SlideShare a Scribd company logo
1 of 31
PROBABILISTIC DATA
                  STRUCTURES
                                Jim Duey
                             Lonocloud.com
                                @jimduey
                            http://clojure.net




Wednesday, August 1, 2012
WHAT IS A DATA STRUCTURE?



              It is a ‘structure’ that holds ‘data’, allowing you to extract
              information.

              Data gets added to the structure.

              Queries of various sorts are used to extract information.




Wednesday, August 1, 2012
INSPIRATION



              Ilya Katsov

              https://highlyscalable.wordpress.com/2012/05/01/
              probabilistic-structures-web-analytics-data-mining/




Wednesday, August 1, 2012
WORD OF CAUTION


              Many probabilistic data structures use hashing

              Java’s hashCode is not safe across multiple processes

              “Java's hashCode is not safe for distributed systems”

              http://martin.kleppmann.com/2012/06/18/java-hashcode-
              unsafe-for-distributed-systems.html




Wednesday, August 1, 2012
PROBABILISTIC


              Query may return a wrong answer

              The answer is ‘good enough’

              Uses a fraction of the resources i.e. memory or cpu cycles




Wednesday, August 1, 2012
HOW MANY ITEMS?


              If you have a large collection of ‘things’ ...

              And there are some duplicates ...

              And you want to know how many unique things there are.




Wednesday, August 1, 2012
LINEAR COUNTING
     class LinearCounter {
        BitSet mask = new BitSet(m); // m is a design parameter


        void add(value) {
            // get an index for value between 0 .. m
            int position = value.hashCode() % m;


            mask.set(position);
        }


Wednesday, August 1, 2012
LINEAR COUNTING
                                                  1
                                      add()       0
                                                  0
                            Thing 1
                                                  0
                                         add()    1
                            Thing 2               0
                                                  0
                            Thing 3               0
                                          add()   0
                                                  0
                            Thing 4
                                                  1
                                       add()      0

Wednesday, August 1, 2012
LINEAR COUNTING
    class LinearCounter {
        BitSet mask = new BitSet(m); // m is a design parameter
      ...
    }
      Question: How big is m ?




Wednesday, August 1, 2012
LINEAR COUNTING
                                 Load Factor
                    n       Number of unique items expected

                   m              Size of bit mask


          If the load factor is < 1; few collisions, number of

                             bits set is the cardinality.




Wednesday, August 1, 2012
LINEAR COUNTING
                                  Load Factor
                    n       Number of unique items expected

                   m               Size of bit mask



          If the load factor is very high 100; all bits set, no

                            information about cardinality.




Wednesday, August 1, 2012
LINEAR COUNTING
                                Load Factor
                    n       Number of unique items expected

                   m             Size of bit mask



           If the load factor is higher than 1, but not too high;

            many collisions, but some relationship might exist
                    between number of bits set and cardinality.



Wednesday, August 1, 2012
LINEAR COUNTING

              Finding the number of members of the collection


         n = - m * ln ((m - w) / m)
         m is the size of the bit map
         w is the number of 1 s in the bitmap (cardinality)




Wednesday, August 1, 2012
LINEAR COUNTING
    class LinearCounter {
        BitSet mask = new BitSet(m); // m is a design parameter
      ...
    }
      Question: How big is m ?
      m > max(5, 1 / (std-err * n / m) ** 2) * (e ** (n / m) - n / m -1)


      On the order of 1M unique values, m = 154 Kbit,         n/m = 6.5
      On the order of 10M unique values, m = 1.1 Mbit, n/m = 9
      for a standard error of 0.01

Wednesday, August 1, 2012
LINEAR COUNTING


              “Linear-Time Probabilistic Counting Algorithm for
              Database Applications”

              Use table to find bit map size.

              Checkout Ilya’s blog post for some nice graphs.




Wednesday, August 1, 2012
LINEAR COUNTING
                                      1
                                      0
                                      0
                            Thing 1
                                      0
                                      1
                            Thing 2   0
                                      0
                            Thing 3   0
                                      0
                                      0
                            Thing 4
                                      1
                                      0

Wednesday, August 1, 2012
1
                                      0
                                      1
                            Thing 1
                                      0
                                      1
                            Thing 2   1
                                      0
                            Thing 3   0
                                      1
                                      0
                            Thing 4
                                      1
                                      1

Wednesday, August 1, 2012
BLOOM FILTER



              If you have a large collection of ‘things’ ...

              And you want to know if some thing is in the collection.




Wednesday, August 1, 2012
BLOOM FILTER
                                      1
                                      0
                                      1
                            Thing 1
                                      0
                                      1
                            Thing 2   1
                                      0
                            Thing 3   0
                                      1
                                      0
                            Thing 4
                                      1
                                      1

Wednesday, August 1, 2012
BLOOM FILTER
                                          1
                                          0
                                          1
                                          0
                                          1
                                          1
                            Other thing
                                          0
                                          0
                                          1
                                          0
                                          1
                                          1

Wednesday, August 1, 2012
BLOOM FILTER
                                            1
                                            0
                                            1
                                            0
                                            1
                                            1
                            Missing thing
                                            0
                                            0
                                            1
                                            0
                                            1
                                            1

Wednesday, August 1, 2012
BLOOM FILTER
              How big to make ‘m’ and ‘k’?

                    ‘m’ is the number of bits in the filter

                    ‘k’ is the number of separate hash functions

     m = - (n * ln p) / (ln 2) ** 2


     n is the the number of distinct items to be stored
     p is the probability of a false positive
     m = - (1M * ln .01) / (ln 2) ** 2 = 9.6 Mbits = 1.2 MB

Wednesday, August 1, 2012
BLOOM FILTER
              How big to make ‘m’ and ‘k’?

                    ‘m’ is the number of bits in the filter

                    ‘k’ is the number of separate hash functions

     m = - (n * ln p) / (ln 2) ^ 2
     k = m / n * ln 2


     k = 9.6M / 1M * 0.69 = 6.64 = 7 hash functions



Wednesday, August 1, 2012
BLOOM FILTER

              You can’t query a Bloom filter for cardinality

              You can’t remove an item once it’s been added

              Many variants of the Bloom filter, some that address these
              issues




Wednesday, August 1, 2012
HASH FUNCTIONS
              How to find many hash functions?

                    “Out of one, many”

                    Make the size of your bit mask a power of 2

                    By masking off bit fields, you can get multiple hash values
                    from a single hash function.

                    a 16 bit hash will cover a 65Kbit index

                    512 bit hash will give 32 16-bit hashes

Wednesday, August 1, 2012
COUNT-MIN SKETCH
              When you want to know how many of each item there is in a
              collection.




Wednesday, August 1, 2012
COUNT-MIN SKETCH
                                                 w
                                     +1


                                                                   +1
               Thing 1
                                                                        d
                                                 +1

                                                              +1

      Each box is a counter.
      Each row is indexed by a corresponding hash function.

Wednesday, August 1, 2012
COUNT-MIN SKETCH
                                                   w
                                        a


                                                                     b
         Some thing
                                                                         d
                                                    c

                                                                 d

      Estimated frequency for ‘Some thing’ is min(a, b, c, d).


Wednesday, August 1, 2012
COUNT-MIN SKETCH
              How big to make ‘w’ and ‘d’?

                    ‘w’ is the number of counters per hash function

                    limits the magnitude of the error

                    ‘d’ is the number of separate hash functions

                    controls the probability that the estimation is greater than
                    the error




Wednesday, August 1, 2012
COUNT-MIN SKETCH

                            error-limit <= 2 * n / w
    probability limit exceeded = 1 - (1 / 2) ** d


    n = total number of items counted
    w = number of counters per hash function
    d = number of separate hash functions


         Works best on skewed data.



Wednesday, August 1, 2012
RESOURCES
              https://highlyscalable.wordpress.com/2012/05/01/
              probabilistic-structures-web-analytics-data-mining/

              http://blog.aggregateknowledge.com/

               http://lkozma.net/blog/sketching-data-structures/

              https://sites.google.com/site/countminsketch/home

              “PyCon 2011: Handling ridiculous amounts of data with
              probabilistic data structures”



Wednesday, August 1, 2012

More Related Content

More from StampedeCon

The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017StampedeCon
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...StampedeCon
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017StampedeCon
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...StampedeCon
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...StampedeCon
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017StampedeCon
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017StampedeCon
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017StampedeCon
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017StampedeCon
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017StampedeCon
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...StampedeCon
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...StampedeCon
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016StampedeCon
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016StampedeCon
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
 

More from StampedeCon (20)

The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

A Survey of Probabilistic Data Structures - StampedeCon 2012

  • 1. PROBABILISTIC DATA STRUCTURES Jim Duey Lonocloud.com @jimduey http://clojure.net Wednesday, August 1, 2012
  • 2. WHAT IS A DATA STRUCTURE? It is a ‘structure’ that holds ‘data’, allowing you to extract information. Data gets added to the structure. Queries of various sorts are used to extract information. Wednesday, August 1, 2012
  • 3. INSPIRATION Ilya Katsov https://highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics-data-mining/ Wednesday, August 1, 2012
  • 4. WORD OF CAUTION Many probabilistic data structures use hashing Java’s hashCode is not safe across multiple processes “Java's hashCode is not safe for distributed systems” http://martin.kleppmann.com/2012/06/18/java-hashcode- unsafe-for-distributed-systems.html Wednesday, August 1, 2012
  • 5. PROBABILISTIC Query may return a wrong answer The answer is ‘good enough’ Uses a fraction of the resources i.e. memory or cpu cycles Wednesday, August 1, 2012
  • 6. HOW MANY ITEMS? If you have a large collection of ‘things’ ... And there are some duplicates ... And you want to know how many unique things there are. Wednesday, August 1, 2012
  • 7. LINEAR COUNTING class LinearCounter { BitSet mask = new BitSet(m); // m is a design parameter void add(value) { // get an index for value between 0 .. m int position = value.hashCode() % m; mask.set(position); } Wednesday, August 1, 2012
  • 8. LINEAR COUNTING 1 add() 0 0 Thing 1 0 add() 1 Thing 2 0 0 Thing 3 0 add() 0 0 Thing 4 1 add() 0 Wednesday, August 1, 2012
  • 9. LINEAR COUNTING class LinearCounter { BitSet mask = new BitSet(m); // m is a design parameter ... } Question: How big is m ? Wednesday, August 1, 2012
  • 10. LINEAR COUNTING Load Factor n Number of unique items expected m Size of bit mask If the load factor is < 1; few collisions, number of bits set is the cardinality. Wednesday, August 1, 2012
  • 11. LINEAR COUNTING Load Factor n Number of unique items expected m Size of bit mask If the load factor is very high 100; all bits set, no information about cardinality. Wednesday, August 1, 2012
  • 12. LINEAR COUNTING Load Factor n Number of unique items expected m Size of bit mask If the load factor is higher than 1, but not too high; many collisions, but some relationship might exist between number of bits set and cardinality. Wednesday, August 1, 2012
  • 13. LINEAR COUNTING Finding the number of members of the collection n = - m * ln ((m - w) / m) m is the size of the bit map w is the number of 1 s in the bitmap (cardinality) Wednesday, August 1, 2012
  • 14. LINEAR COUNTING class LinearCounter { BitSet mask = new BitSet(m); // m is a design parameter ... } Question: How big is m ? m > max(5, 1 / (std-err * n / m) ** 2) * (e ** (n / m) - n / m -1) On the order of 1M unique values, m = 154 Kbit, n/m = 6.5 On the order of 10M unique values, m = 1.1 Mbit, n/m = 9 for a standard error of 0.01 Wednesday, August 1, 2012
  • 15. LINEAR COUNTING “Linear-Time Probabilistic Counting Algorithm for Database Applications” Use table to find bit map size. Checkout Ilya’s blog post for some nice graphs. Wednesday, August 1, 2012
  • 16. LINEAR COUNTING 1 0 0 Thing 1 0 1 Thing 2 0 0 Thing 3 0 0 0 Thing 4 1 0 Wednesday, August 1, 2012
  • 17. 1 0 1 Thing 1 0 1 Thing 2 1 0 Thing 3 0 1 0 Thing 4 1 1 Wednesday, August 1, 2012
  • 18. BLOOM FILTER If you have a large collection of ‘things’ ... And you want to know if some thing is in the collection. Wednesday, August 1, 2012
  • 19. BLOOM FILTER 1 0 1 Thing 1 0 1 Thing 2 1 0 Thing 3 0 1 0 Thing 4 1 1 Wednesday, August 1, 2012
  • 20. BLOOM FILTER 1 0 1 0 1 1 Other thing 0 0 1 0 1 1 Wednesday, August 1, 2012
  • 21. BLOOM FILTER 1 0 1 0 1 1 Missing thing 0 0 1 0 1 1 Wednesday, August 1, 2012
  • 22. BLOOM FILTER How big to make ‘m’ and ‘k’? ‘m’ is the number of bits in the filter ‘k’ is the number of separate hash functions m = - (n * ln p) / (ln 2) ** 2 n is the the number of distinct items to be stored p is the probability of a false positive m = - (1M * ln .01) / (ln 2) ** 2 = 9.6 Mbits = 1.2 MB Wednesday, August 1, 2012
  • 23. BLOOM FILTER How big to make ‘m’ and ‘k’? ‘m’ is the number of bits in the filter ‘k’ is the number of separate hash functions m = - (n * ln p) / (ln 2) ^ 2 k = m / n * ln 2 k = 9.6M / 1M * 0.69 = 6.64 = 7 hash functions Wednesday, August 1, 2012
  • 24. BLOOM FILTER You can’t query a Bloom filter for cardinality You can’t remove an item once it’s been added Many variants of the Bloom filter, some that address these issues Wednesday, August 1, 2012
  • 25. HASH FUNCTIONS How to find many hash functions? “Out of one, many” Make the size of your bit mask a power of 2 By masking off bit fields, you can get multiple hash values from a single hash function. a 16 bit hash will cover a 65Kbit index 512 bit hash will give 32 16-bit hashes Wednesday, August 1, 2012
  • 26. COUNT-MIN SKETCH When you want to know how many of each item there is in a collection. Wednesday, August 1, 2012
  • 27. COUNT-MIN SKETCH w +1 +1 Thing 1 d +1 +1 Each box is a counter. Each row is indexed by a corresponding hash function. Wednesday, August 1, 2012
  • 28. COUNT-MIN SKETCH w a b Some thing d c d Estimated frequency for ‘Some thing’ is min(a, b, c, d). Wednesday, August 1, 2012
  • 29. COUNT-MIN SKETCH How big to make ‘w’ and ‘d’? ‘w’ is the number of counters per hash function limits the magnitude of the error ‘d’ is the number of separate hash functions controls the probability that the estimation is greater than the error Wednesday, August 1, 2012
  • 30. COUNT-MIN SKETCH error-limit <= 2 * n / w probability limit exceeded = 1 - (1 / 2) ** d n = total number of items counted w = number of counters per hash function d = number of separate hash functions Works best on skewed data. Wednesday, August 1, 2012
  • 31. RESOURCES https://highlyscalable.wordpress.com/2012/05/01/ probabilistic-structures-web-analytics-data-mining/ http://blog.aggregateknowledge.com/ http://lkozma.net/blog/sketching-data-structures/ https://sites.google.com/site/countminsketch/home “PyCon 2011: Handling ridiculous amounts of data with probabilistic data structures” Wednesday, August 1, 2012