Computational Social Science, Lecture 02: An Introduction to Counting
1. Introduction to Counting
APAM E4990
Computational Social Science
Jake Hofman
Columbia University
February 1, 2013
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 1 / 21
3. Why counting?
sta·tis·tic
noun
1. A function of a random sample of data
2. A functional of the distribution over a random variable
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 3 / 21
4. Why counting?
Problem:
Traditionally difficult to estimate distributions due to small sample
sizes or sparsity
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 4 / 21
5. Why counting?
Potential solution:
Sacrifice granularity for precision
(e.g., bin observations into larger groups)
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 4 / 21
6. Why counting?
Potential solution:
Develop sophisticated methods that generalize well from small
samples
(e.g., model distributions with parametric forms)
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 4 / 21
8. Why counting?
(Partial) solution:
When observations are plentiful, simply count and divide to
estimate distributions with relative frequencies
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 6 / 21
9. Why counting?
The good:
Shift away from sophisticated statistical methods on small samples
to simple methods on large samples
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 7 / 21
10. Why counting?
The bad:
Even simple methods (e.g., counting) are computationally
challenging at large scales
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 7 / 21
11. Why counting?
Claim:
Solving the counting problem at scale enables you to investigate
many interesting questions in the social sciences
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 7 / 21
13. Learning to count
This week:
Counting at small/medium scales on a single machine
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 9 / 21
14. Learning to count
This week:
Counting at small/medium scales on a single machine
Following weeks:
Counting at large scales in parallel
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 9 / 21
15. Counting, the easy way
• Load dataset into memory
• Arrange observations into groups of interest
• Compute distributions and statistics within each group
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 10 / 21
16. Example: Anatomy of the long tail
Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 11 / 21
17. Example: Anatomy of the long tail
Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 11 / 21
19. Example: Movielens
Density
1 2 3 4 5
Mean Rating by Movie
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 13 / 21
20. Example: Movielens
100%
75%
CDF
50%
25%
0%
0 3,000 6,000 9,000
Movie Rank
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 14 / 21
21. Example: Movielens
Density
10 100 1,000
User eccentricity
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 15 / 21
22. The group-by operation
We estimate conditional distributions by grouping observations by
their group identity, or key, and counting values within groups
p( y | x1 , x2 , x3 , . . .)
value key
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 16 / 21
23. The group-by operation
Generic Group-By
for each observation:
place observation in bucket for corresponding group
for each group:
compute function over observations in bucket
output group and result
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 17 / 21
24. The group-by operation
Generic Group-By
for each observation:
place observation in bucket for corresponding group
for each group:
compute function over observations in bucket
output group and result
Useful for computing arbitrary within-group statistics when we
have required memory
(e.g., conditional distribution, median, etc.)
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 17 / 21
25. The group-by operation
Combinable Group-By
for each observation:
update result for corresponding group,
as function of current result and observation
for each group:
output group and result
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 18 / 21
26. The group-by operation
Combinable Group-By
for each observation:
update result for corresponding group,
as function of current result and observation
for each group:
output group and result
Useful for computing a subset of within-group statistics with a
limited memory footprint
(e.g., min, mean, max, variance, etc.)
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 18 / 21
27. Example: Anatomy of the long tail
Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
What do we do when the full dataset exceeds available memory?
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 19 / 21
28. Example: Anatomy of the long tail
Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
What do we do when the full dataset exceeds available memory?
Sampling?
Unreliable estimates for rare groups
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 19 / 21
29. Example: Anatomy of the long tail
Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
What do we do when the full dataset exceeds available memory?
Random access from disk?
1000x more storage, but 1000x slower1
1
Numbers every programmer should know
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 19 / 21
30. Example: Anatomy of the long tail
Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
What do we do when the full dataset exceeds available memory?
Streaming?
Read data one observation at a time, storing only needed state
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 19 / 21
31. The group-by operation
For arbitrary input data:
Memory Scenario Distributions Statistics
N Small dataset Yes General
V*G Small distributions Yes General
G Small # groups No Combinable
V Small # outcomes No No
1 Large # both No No
N = total number of observations
G = number of distinct groups
V = largest number of distinct values within group
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 20 / 21
32. The group-by operation
For pre-grouped input data:
Memory Scenario Distributions Statistics
N Small dataset Yes General
V*G Small distributions Yes General
G Small # groups No Combinable
V Small # outcomes Yes General
1 Large # both No Combinable
N = total number of observations
G = number of distinct groups
V = largest number of distinct values within group
Jake Hofman (Columbia University) Intro to Counting February 1, 2013 21 / 21