SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Introduction to Counting
                                            APAM E4990
                                      Computational Social Science


                                             Jake Hofman

                                            Columbia University


                                            February 1, 2013




Jake Hofman   (Columbia University)            Intro to Counting     February 1, 2013   1 / 21
Why counting?




Jake Hofman   (Columbia University)   Intro to Counting   February 1, 2013   2 / 21
Why counting?




                                      sta·tis·tic

       noun
       1. A function of a random sample of data
       2. A functional of the distribution over a random variable




Jake Hofman   (Columbia University)   Intro to Counting        February 1, 2013   3 / 21
Why counting?




                                       Problem:

        Traditionally difficult to estimate distributions due to small sample
                                  sizes or sparsity




Jake Hofman   (Columbia University)   Intro to Counting         February 1, 2013   4 / 21
Why counting?




                                             Potential solution:

                                      Sacrifice granularity for precision

                            (e.g., bin observations into larger groups)




Jake Hofman   (Columbia University)              Intro to Counting         February 1, 2013   4 / 21
Why counting?




                                       Potential solution:

              Develop sophisticated methods that generalize well from small
                                        samples

                       (e.g., model distributions with parametric forms)




Jake Hofman    (Columbia University)       Intro to Counting          February 1, 2013   4 / 21
Why counting?




Jake Hofman   (Columbia University)   Intro to Counting   February 1, 2013   5 / 21
Why counting?




                                      (Partial) solution:

              When observations are plentiful, simply count and divide to
                   estimate distributions with relative frequencies




Jake Hofman   (Columbia University)       Intro to Counting       February 1, 2013   6 / 21
Why counting?




                                      The good:

        Shift away from sophisticated statistical methods on small samples
                       to simple methods on large samples




Jake Hofman   (Columbia University)   Intro to Counting        February 1, 2013   7 / 21
Why counting?




                                      The bad:

              Even simple methods (e.g., counting) are computationally
                            challenging at large scales




Jake Hofman   (Columbia University)   Intro to Counting         February 1, 2013   7 / 21
Why counting?




                                         Claim:

         Solving the counting problem at scale enables you to investigate
                 many interesting questions in the social sciences




Jake Hofman   (Columbia University)   Intro to Counting        February 1, 2013   7 / 21
Why counting?




Jake Hofman   (Columbia University)   Intro to Counting   February 1, 2013   8 / 21
Learning to count




                                      This week:

                  Counting at small/medium scales on a single machine




Jake Hofman   (Columbia University)   Intro to Counting          February 1, 2013   9 / 21
Learning to count




                                              This week:

                  Counting at small/medium scales on a single machine


                                           Following weeks:

                                  Counting at large scales in parallel




Jake Hofman   (Columbia University)           Intro to Counting          February 1, 2013   9 / 21
Counting, the easy way




              • Load dataset into memory
              • Arrange observations into groups of interest
              • Compute distributions and statistics within each group




Jake Hofman    (Columbia University)   Intro to Counting          February 1, 2013   10 / 21
Example: Anatomy of the long tail




              Dataset            Users   Items     Rating levels   Observations
              Movielens          100K     10K           10            10M
               Netflix            500K     20K            5            100M




Jake Hofman   (Columbia University)         Intro to Counting           February 1, 2013   11 / 21
Example: Anatomy of the long tail




              Dataset            Users   Items     Rating levels   Observations
              Movielens          100K     10K           10            10M
               Netflix            500K     20K            5            100M




Jake Hofman   (Columbia University)         Intro to Counting           February 1, 2013   11 / 21
Example: Movielens
                                3,000,000




                                2,000,000
                        Count




                                1,000,000




                                       0

                                            1           2           3   4   5
                                                            Rating
Jake Hofman   (Columbia University)             Intro to Counting               February 1, 2013   12 / 21
Example: Movielens



                        Density




                                      1      2            3      4   5
                                          Mean Rating by Movie
Jake Hofman   (Columbia University)          Intro to Counting           February 1, 2013   13 / 21
Example: Movielens

                              100%




                              75%
                        CDF




                              50%




                              25%




                               0%

                                      0   3,000          6,000   9,000
                                                Movie Rank
Jake Hofman   (Columbia University)        Intro to Counting             February 1, 2013   14 / 21
Example: Movielens



                        Density




                                  10   100                  1,000
                                       User eccentricity
Jake Hofman   (Columbia University)     Intro to Counting           February 1, 2013   15 / 21
The group-by operation




        We estimate conditional distributions by grouping observations by
         their group identity, or key, and counting values within groups


                                      p( y | x1 , x2 , x3 , . . .)
                                         value                key




Jake Hofman   (Columbia University)              Intro to Counting   February 1, 2013   16 / 21
The group-by operation

                                      Generic Group-By

       for each observation:
         place observation in bucket for corresponding group

       for each group:
         compute function over observations in bucket
         output group and result




Jake Hofman   (Columbia University)      Intro to Counting   February 1, 2013   17 / 21
The group-by operation

                                       Generic Group-By

       for each observation:
         place observation in bucket for corresponding group

       for each group:
         compute function over observations in bucket
         output group and result


              Useful for computing arbitrary within-group statistics when we
                                   have required memory
                        (e.g., conditional distribution, median, etc.)



Jake Hofman    (Columbia University)      Intro to Counting        February 1, 2013   17 / 21
The group-by operation

                                      Combinable Group-By

       for each observation:
         update result for corresponding group,
         as function of current result and observation

       for each group:
         output group and result




Jake Hofman   (Columbia University)        Intro to Counting   February 1, 2013   18 / 21
The group-by operation

                                       Combinable Group-By

       for each observation:
         update result for corresponding group,
         as function of current result and observation

       for each group:
         output group and result


              Useful for computing a subset of within-group statistics with a
                                  limited memory footprint
                           (e.g., min, mean, max, variance, etc.)



Jake Hofman    (Columbia University)        Intro to Counting      February 1, 2013   18 / 21
Example: Anatomy of the long tail




              Dataset            Users   Items     Rating levels   Observations
              Movielens          100K     10K           10            10M
               Netflix            500K     20K            5            100M


         What do we do when the full dataset exceeds available memory?




Jake Hofman   (Columbia University)         Intro to Counting           February 1, 2013   19 / 21
Example: Anatomy of the long tail



              Dataset            Users   Items      Rating levels   Observations
              Movielens          100K     10K            10            10M
               Netflix            500K     20K             5            100M


         What do we do when the full dataset exceeds available memory?

                                              Sampling?
                                 Unreliable estimates for rare groups




Jake Hofman   (Columbia University)          Intro to Counting           February 1, 2013   19 / 21
Example: Anatomy of the long tail



              Dataset                Users   Items     Rating levels   Observations
              Movielens              100K     10K           10            10M
               Netflix                500K     20K            5            100M


         What do we do when the full dataset exceeds available memory?

                                       Random access from disk?
                                  1000x more storage, but 1000x slower1




              1
                  Numbers every programmer should know
Jake Hofman       (Columbia University)         Intro to Counting           February 1, 2013   19 / 21
Example: Anatomy of the long tail



              Dataset            Users   Items     Rating levels   Observations
              Movielens          100K     10K           10            10M
               Netflix            500K     20K            5            100M


         What do we do when the full dataset exceeds available memory?

                                  Streaming?
          Read data one observation at a time, storing only needed state




Jake Hofman   (Columbia University)         Intro to Counting           February 1, 2013   19 / 21
The group-by operation


                                      For arbitrary input data:

              Memory               Scenario              Distributions    Statistics
                N                Small dataset                Yes          General
               V*G             Small distributions            Yes          General
                G               Small # groups                No         Combinable
                V              Small # outcomes               No             No
                1                Large # both                 No             No


                            N = total number of observations
                             G = number of distinct groups
                    V = largest number of distinct values within group



Jake Hofman   (Columbia University)          Intro to Counting              February 1, 2013   20 / 21
The group-by operation


                                      For pre-grouped input data:

              Memory               Scenario               Distributions    Statistics
                N                Small dataset                 Yes          General
               V*G             Small distributions             Yes          General
                G               Small # groups                 No         Combinable
                V              Small # outcomes                Yes          General
                1                Large # both                  No         Combinable


                            N = total number of observations
                             G = number of distinct groups
                    V = largest number of distinct values within group



Jake Hofman   (Columbia University)           Intro to Counting              February 1, 2013   21 / 21

Weitere ähnliche Inhalte

Andere mochten auch

Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wranglingjakehofman
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIjakehofman
 
Computational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part IComputational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part Ijakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overviewjakehofman
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1jakehofman
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scalejakehofman
 
Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03jakehofman
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scalejakehofman
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in Rjakehofman
 
TalkPad, una idea innovadora y revolucionadora
TalkPad, una idea innovadora y revolucionadoraTalkPad, una idea innovadora y revolucionadora
TalkPad, una idea innovadora y revolucionadoraguest0fb0ce
 
Porque es tan dificil hacer activos
Porque es tan dificil hacer activosPorque es tan dificil hacer activos
Porque es tan dificil hacer activosFONZI
 
Cerveceria LOS VIKINGOS
Cerveceria LOS VIKINGOSCerveceria LOS VIKINGOS
Cerveceria LOS VIKINGOSjorchuk
 
Smith_David_BB pdf 2.11.16
Smith_David_BB pdf 2.11.16Smith_David_BB pdf 2.11.16
Smith_David_BB pdf 2.11.16David W Smith
 

Andere mochten auch (20)

Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part II
 
Computational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part IComputational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part I
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overview
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scale
 
Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03Data-driven Modeling: Lecture 03
Data-driven Modeling: Lecture 03
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scale
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in R
 
Usabilidad
UsabilidadUsabilidad
Usabilidad
 
TalkPad, una idea innovadora y revolucionadora
TalkPad, una idea innovadora y revolucionadoraTalkPad, una idea innovadora y revolucionadora
TalkPad, una idea innovadora y revolucionadora
 
практ5
практ5практ5
практ5
 
Sanchez Navas Bloque 5
Sanchez Navas Bloque 5Sanchez Navas Bloque 5
Sanchez Navas Bloque 5
 
ю 597 048
ю 597 048 ю 597 048
ю 597 048
 
Porque es tan dificil hacer activos
Porque es tan dificil hacer activosPorque es tan dificil hacer activos
Porque es tan dificil hacer activos
 
Disco Duro
Disco DuroDisco Duro
Disco Duro
 
Cerveceria LOS VIKINGOS
Cerveceria LOS VIKINGOSCerveceria LOS VIKINGOS
Cerveceria LOS VIKINGOS
 
Smith_David_BB pdf 2.11.16
Smith_David_BB pdf 2.11.16Smith_David_BB pdf 2.11.16
Smith_David_BB pdf 2.11.16
 
практ6
практ6практ6
практ6
 

Ähnlich wie Computational Social Science, Lecture 02: An Introduction to Counting

Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09jakehofman
 
Data-driven modeling: Lecture 02
Data-driven modeling: Lecture 02Data-driven modeling: Lecture 02
Data-driven modeling: Lecture 02jakehofman
 
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01jakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationjakehofman
 
Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10jakehofman
 

Ähnlich wie Computational Social Science, Lecture 02: An Introduction to Counting (6)

Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09
 
Data-driven modeling: Lecture 02
Data-driven modeling: Lecture 02Data-driven modeling: Lecture 02
Data-driven modeling: Lecture 02
 
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalization
 
Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10
 

Mehr von jakehofman

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2jakehofman
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1jakehofman
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networksjakehofman
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classificationjakehofman
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systemsjakehofman
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayesjakehofman
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studiesjakehofman
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Sciencejakehofman
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbitjakehofman
 
Using Data to Understand the Brain
Using Data to Understand the BrainUsing Data to Understand the Brain
Using Data to Understand the Brainjakehofman
 

Mehr von jakehofman (10)

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networks
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systems
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studies
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbit
 
Using Data to Understand the Brain
Using Data to Understand the BrainUsing Data to Understand the Brain
Using Data to Understand the Brain
 

Computational Social Science, Lecture 02: An Introduction to Counting

  • 1. Introduction to Counting APAM E4990 Computational Social Science Jake Hofman Columbia University February 1, 2013 Jake Hofman (Columbia University) Intro to Counting February 1, 2013 1 / 21
  • 2. Why counting? Jake Hofman (Columbia University) Intro to Counting February 1, 2013 2 / 21
  • 3. Why counting? sta·tis·tic noun 1. A function of a random sample of data 2. A functional of the distribution over a random variable Jake Hofman (Columbia University) Intro to Counting February 1, 2013 3 / 21
  • 4. Why counting? Problem: Traditionally difficult to estimate distributions due to small sample sizes or sparsity Jake Hofman (Columbia University) Intro to Counting February 1, 2013 4 / 21
  • 5. Why counting? Potential solution: Sacrifice granularity for precision (e.g., bin observations into larger groups) Jake Hofman (Columbia University) Intro to Counting February 1, 2013 4 / 21
  • 6. Why counting? Potential solution: Develop sophisticated methods that generalize well from small samples (e.g., model distributions with parametric forms) Jake Hofman (Columbia University) Intro to Counting February 1, 2013 4 / 21
  • 7. Why counting? Jake Hofman (Columbia University) Intro to Counting February 1, 2013 5 / 21
  • 8. Why counting? (Partial) solution: When observations are plentiful, simply count and divide to estimate distributions with relative frequencies Jake Hofman (Columbia University) Intro to Counting February 1, 2013 6 / 21
  • 9. Why counting? The good: Shift away from sophisticated statistical methods on small samples to simple methods on large samples Jake Hofman (Columbia University) Intro to Counting February 1, 2013 7 / 21
  • 10. Why counting? The bad: Even simple methods (e.g., counting) are computationally challenging at large scales Jake Hofman (Columbia University) Intro to Counting February 1, 2013 7 / 21
  • 11. Why counting? Claim: Solving the counting problem at scale enables you to investigate many interesting questions in the social sciences Jake Hofman (Columbia University) Intro to Counting February 1, 2013 7 / 21
  • 12. Why counting? Jake Hofman (Columbia University) Intro to Counting February 1, 2013 8 / 21
  • 13. Learning to count This week: Counting at small/medium scales on a single machine Jake Hofman (Columbia University) Intro to Counting February 1, 2013 9 / 21
  • 14. Learning to count This week: Counting at small/medium scales on a single machine Following weeks: Counting at large scales in parallel Jake Hofman (Columbia University) Intro to Counting February 1, 2013 9 / 21
  • 15. Counting, the easy way • Load dataset into memory • Arrange observations into groups of interest • Compute distributions and statistics within each group Jake Hofman (Columbia University) Intro to Counting February 1, 2013 10 / 21
  • 16. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M Jake Hofman (Columbia University) Intro to Counting February 1, 2013 11 / 21
  • 17. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M Jake Hofman (Columbia University) Intro to Counting February 1, 2013 11 / 21
  • 18. Example: Movielens 3,000,000 2,000,000 Count 1,000,000 0 1 2 3 4 5 Rating Jake Hofman (Columbia University) Intro to Counting February 1, 2013 12 / 21
  • 19. Example: Movielens Density 1 2 3 4 5 Mean Rating by Movie Jake Hofman (Columbia University) Intro to Counting February 1, 2013 13 / 21
  • 20. Example: Movielens 100% 75% CDF 50% 25% 0% 0 3,000 6,000 9,000 Movie Rank Jake Hofman (Columbia University) Intro to Counting February 1, 2013 14 / 21
  • 21. Example: Movielens Density 10 100 1,000 User eccentricity Jake Hofman (Columbia University) Intro to Counting February 1, 2013 15 / 21
  • 22. The group-by operation We estimate conditional distributions by grouping observations by their group identity, or key, and counting values within groups p( y | x1 , x2 , x3 , . . .) value key Jake Hofman (Columbia University) Intro to Counting February 1, 2013 16 / 21
  • 23. The group-by operation Generic Group-By for each observation: place observation in bucket for corresponding group for each group: compute function over observations in bucket output group and result Jake Hofman (Columbia University) Intro to Counting February 1, 2013 17 / 21
  • 24. The group-by operation Generic Group-By for each observation: place observation in bucket for corresponding group for each group: compute function over observations in bucket output group and result Useful for computing arbitrary within-group statistics when we have required memory (e.g., conditional distribution, median, etc.) Jake Hofman (Columbia University) Intro to Counting February 1, 2013 17 / 21
  • 25. The group-by operation Combinable Group-By for each observation: update result for corresponding group, as function of current result and observation for each group: output group and result Jake Hofman (Columbia University) Intro to Counting February 1, 2013 18 / 21
  • 26. The group-by operation Combinable Group-By for each observation: update result for corresponding group, as function of current result and observation for each group: output group and result Useful for computing a subset of within-group statistics with a limited memory footprint (e.g., min, mean, max, variance, etc.) Jake Hofman (Columbia University) Intro to Counting February 1, 2013 18 / 21
  • 27. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Jake Hofman (Columbia University) Intro to Counting February 1, 2013 19 / 21
  • 28. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Sampling? Unreliable estimates for rare groups Jake Hofman (Columbia University) Intro to Counting February 1, 2013 19 / 21
  • 29. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Random access from disk? 1000x more storage, but 1000x slower1 1 Numbers every programmer should know Jake Hofman (Columbia University) Intro to Counting February 1, 2013 19 / 21
  • 30. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Streaming? Read data one observation at a time, storing only needed state Jake Hofman (Columbia University) Intro to Counting February 1, 2013 19 / 21
  • 31. The group-by operation For arbitrary input data: Memory Scenario Distributions Statistics N Small dataset Yes General V*G Small distributions Yes General G Small # groups No Combinable V Small # outcomes No No 1 Large # both No No N = total number of observations G = number of distinct groups V = largest number of distinct values within group Jake Hofman (Columbia University) Intro to Counting February 1, 2013 20 / 21
  • 32. The group-by operation For pre-grouped input data: Memory Scenario Distributions Statistics N Small dataset Yes General V*G Small distributions Yes General G Small # groups No Combinable V Small # outcomes Yes General 1 Large # both No Combinable N = total number of observations G = number of distinct groups V = largest number of distinct values within group Jake Hofman (Columbia University) Intro to Counting February 1, 2013 21 / 21