1. Learning from Web Activity
Jake Hofman
Yahoo! Research
November 18, 2010
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 1 / 33
2. Outline
1 Agenda: Just enough philosophy
2 Case study: Demographic diversity on the Web
3 Conclusion: Lessons learned
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 2 / 33
3. Agenda
Size (only kind of) matters
Big Data
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 3 / 33
4. Agenda
Size (only kind of) matters
Big Data
Lots of data means lots to learn (from)
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 3 / 33
5. Agenda
Size (only kind of) matters
Big Data
But the “big” part isn’t intrinsically interesting
(although large sample sizes are always good)
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 3 / 33
6. Agenda
Size (only kind of) matters
Big Data
Regardless of size, it’s really about “data jeopardy”
(To what question are these data the answer?)
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 3 / 33
7. Agenda
Tools
Data tools:
• Shell scripting & Python
Munging, Glue
• R
Modeling, Visualization
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 4 / 33
8. Agenda
Tools
Big Data tools:
• Hadoop & Pig
Filtering, Aggregating
• Shell scripting & Python
Munging, Glue
• R
Modeling, Visualization
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 4 / 33
9. Agenda
The clean real story
“We have a habit in writing articles published in
scientific journals to make the work as finished as
possible, to cover all the tracks, to not worry about the
blind alleys or to describe how you had the wrong idea
first, and so on. So there isn’t any place to publish, in
a dignified manner, what you actually did in order to
get to do the work ...”
-Richard Feynman
Nobel Lecture1, 1965
1
http://bit.ly/feynmannobel
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 5 / 33
10. Outline
1 Agenda: Just enough philosophy
2 Case study: Demographic diversity on the Web
3 Conclusion: Lessons learned
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 6 / 33
11. Demographic diversity on the Web
The clean story
(covering our tracks)
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 7 / 33
12. Demographic diversity on the Web
with Irmak Sirer and Sharad Goel
How diverse is the Web?
To what extent do online experiences vary across demographic
groups?
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 8 / 33
13. Diversity of the Web
Data
• Representative sample of 265,000 individuals in the US, paid
via the Nielsen MegaPanel2
• Log of anonymized, complete browsing activity from June
2009 through May 2010 (URLs viewed, timestamps, etc.)
• Detailed individual and household demographic information
(age, education, income, race, sex, etc.)
2
http://bit.ly/nielsenonline
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 9 / 33
14. Diversity of the Web
Data
• Transform all demographic attributes to binary variables
e.g., Age → Over/Under 25, Race → White/Non-White,
Sex → Female/Male
• Normalize pageviews to at most three domain levels, sans www
e.g. www.yahoo.com → yahoo.com,
us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com
• Restrict to top 100k most popular sites
• Aggregate activity at the site, group, and user levels
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 10 / 33
15. Diversity of the Web
Pig to the rescue
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 11 / 33
16. Diversity of the Web
Site-level skew
How diverse are site audiences?
• For each site and attribute,
calculate the skew in visitors
(e.g., 93% of pageviews on
foxnews.com are by White
users)
• For each attribute, plot the
distribution of visitor skew
across all sites
Proportion White Visitors
Density
0.0 0.2 0.4 0.6 0.8 1.0
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 12 / 33
17. Diversity of the Web
Site-level skew
Proportion Female Visitors
Density
0.0 0.2 0.4 0.6 0.8 1.0
Proportion White VisitorsDensity
0.0 0.2 0.4 0.6 0.8 1.0
Proportion College Educated Visitors
Density
0.0 0.2 0.4 0.6 0.8 1.0
Proportion Adult Visitors
Density
0.0 0.2 0.4 0.6 0.8 1.0 Proportion of Visitors With
Household Incomes Under $50,000
Density
0.0 0.2 0.4 0.6 0.8 1.0
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 13 / 33
18. Diversity of the Web
Site-level skew
Many sites have skew close the average, but there also popular,
highly-skewed sites
Greater Than 90% Less Than 10%
Female
youravon.com
collectionsetc.com
coveritlive.com
needlive.com
White
foxnews.com
wunderground.com
blackplanet.com
mediatakeout.com
College Educated
news.google.com
nytimes.com
slumz.boxden.com
sythe.com
Over 25 Years Old
mail.yahoo.com
apps.facebook.com
nanowrimo.org
cbox.ws
Household Income
Under $50,000
scarleteen.com
boards.adultswim.com
opentable.com
marketwatch.com
Table 1: A selection of popular sites that are homogeneous along various demographic dimensions.
ilyPer−CapitaPageviews
20
30
40
50
60
70
!
!
!
!Non−White
Male
Non−White
Male
No College
Under 25
No College
Under 25
White
Female
White
FemaleCollege
Over 25
College
Over 25
visually apparent from Figure 5, there are significant differ-
ences in how groups distribute their time on the web. These
differences—which, as mentioned above, hold for highly fre-
quented sites such as Facebook and YouTube—are in some
cases even more pronounced for lower traffic sites. For in-
stance, the gaming site pogo.com accounts for less than 1%
of pageviews among both low and high income users, but
low income users spend almost twice as much of their time
there.Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 14 / 33
19. Diversity of the Web
Site-level skew
Many sites have skew close the average, but there also popular,
highly-skewed sites
Greater Than 90% Less Than 10%
Female
youravon.com
collectionsetc.com
coveritlive.com
needlive.com
White
foxnews.com
wunderground.com
blackplanet.com
mediatakeout.com
College Educated
news.google.com
nytimes.com
slumz.boxden.com
sythe.com
Over 25 Years Old
mail.yahoo.com
apps.facebook.com
nanowrimo.org
cbox.ws
Household Income
Under $50,000
scarleteen.com
boards.adultswim.com
opentable.com
marketwatch.com
Table 1: A selection of popular sites that are homogeneous along various demographic dimensions.
ilyPer−CapitaPageviews
20
30
40
50
60
70
!
!
!
!Non−White
Male
Non−White
Male
No College
Under 25
No College
Under 25
White
Female
White
FemaleCollege
Over 25
College
Over 25
visually apparent from Figure 5, there are significant differ-
ences in how groups distribute their time on the web. These
differences—which, as mentioned above, hold for highly fre-
quented sites such as Facebook and YouTube—are in some
cases even more pronounced for lower traffic sites. For in-
stance, the gaming site pogo.com accounts for less than 1%
of pageviews among both low and high income users, but
low income users spend almost twice as much of their time
there.
This skew persists even when we restrict attention to the top 10k
or 1k sites
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 14 / 33
20. Diversity of the Web
Sites vs. ZIPs
How do diversity of the online and offline worlds compare?
Proportion Female
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Proportion White
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Proportion College Educated
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 15 / 33
21. Diversity of the Web
Sites vs. ZIPs
How do diversity of the online and offline worlds compare?
Proportion Female
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Proportion White
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Proportion College Educated
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
As expected, neighborhoods are more gender-balanced than sites
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 15 / 33
22. Diversity of the Web
Sites vs. ZIPs
How do diversity of the online and offline worlds compare?
Proportion Female
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Proportion White
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Proportion College Educated
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
But sites typically have more racially diverse audiences than
neighborhoods have residents
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 15 / 33
23. Diversity of the Web
Sites vs. ZIPs
How do diversity of the online and offline worlds compare?
Proportion Female
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Proportion White
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Proportion College Educated
Density
0.0 0.2 0.4 0.6 0.8 1.0
Sites
ZIPs
Skew by education is comparable, with online showing a bias
towards higher education
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 15 / 33
24. Diversity of the Web
Group-level activity
How does browsing activity vary at the group level?
DailyPer−CapitaPageviews
0
10
20
30
40
50
60
70
q
q
q
qNon−White
Male
Non−White
Male
No College
Under 25
No College
Under 25
White
Female
White
FemaleCollege
Over 25
College
Over 25
Race Education Sex Age
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 16 / 33
25. Diversity of the Web
Group-level activity
How does browsing activity vary at the group level?
DailyPer−CapitaPageviews
0
10
20
30
40
50
60
70
q
q
q
qNon−White
Male
Non−White
Male
No College
Under 25
No College
Under 25
White
Female
White
FemaleCollege
Over 25
College
Over 25
Race Education Sex Age
Large differences exist even at the aggregate level
(e.g. women on average generate 40% more pageviews than men)
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 16 / 33
26. Diversity of the Web
Group-level activity
All groups spend more than a third of their time on a handful of
email, search, and social networking sites
PercentofTotalTimeSpentonSite
0.1%
1%
10%
facebook.com
m
ail.yahoo.com
google.com
apps.facebook.com
m
ail.google.com
m
ail.live.com
youtube.com
w
ebm
ail.aol.com
m
w
fb.zynga.com
channel.facebook.com
view
m
orepics.m
yspace.com
search.yahoo.com
m
yspace.com
m
sn.com
am
azon.com
shop.ebay.com
yahoo.com
im
ages.google.com
hom
e.m
yspace.com
m
ail.com
cast.net
bing.com
w
w
w
.yahoo.com
cgi.ebay.com
espn.go.com
m
essaging.m
yspace.com
tw
itter.com
cim
.m
eebo.com
m
y.ebay.com
en.w
ikipedia.org
login.yahoo.com
facebook.m
afiawars.com
m
y.yahoo.com
gam
e3.pogo.com
friends.m
yspace.com
tagged.com
w
orldw
inner.com
m
eebo.com
login.live.com
m
ypoints.com
m
aps.google.com
aol.com
pogo.com
m
w
m
s.zynga.com
new
s.yahoo.com
w
inster.com
netflix.com
fantasysports.yahoo.com
search.aol.com
com
cast.net
alotm
etrics.com
female
male
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 17 / 33
27. Diversity of the Web
Group-level activity
But different groups distribute their time differently, both on
universally popular and on more niche sites
PercentofTotalTimeSpentonSite
0.1%
1%
10%
facebook.com
m
ail.yahoo.com
google.com
apps.facebook.com
m
ail.google.com
m
ail.live.com
youtube.com
w
ebm
ail.aol.com
m
w
fb.zynga.com
channel.facebook.com
view
m
orepics.m
yspace.com
search.yahoo.com
m
yspace.com
m
sn.com
am
azon.com
shop.ebay.com
yahoo.com
im
ages.google.com
hom
e.m
yspace.com
m
ail.com
cast.net
bing.com
w
w
w
.yahoo.com
cgi.ebay.com
espn.go.com
m
essaging.m
yspace.com
tw
itter.com
cim
.m
eebo.com
m
y.ebay.com
en.w
ikipedia.org
login.yahoo.com
facebook.m
afiawars.com
m
y.yahoo.com
gam
e3.pogo.com
friends.m
yspace.com
tagged.com
w
orldw
inner.com
m
eebo.com
login.live.com
m
ypoints.com
m
aps.google.com
aol.com
pogo.com
m
w
m
s.zynga.com
new
s.yahoo.com
w
inster.com
netflix.com
fantasysports.yahoo.com
search.aol.com
com
cast.net
alotm
etrics.com
female
male
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 17 / 33
28. Diversity of the Web
Group-level activity
But different groups distribute their time differently, both on
universally popular and on more niche sites
PercentofTotalTimeSpentonSite
0.1%
1%
10%
facebook.com
m
ail.yahoo.com
google.com
apps.facebook.com
m
ail.google.com
m
ail.live.com
youtube.com
w
ebm
ail.aol.com
m
w
fb.zynga.com
channel.facebook.com
view
m
orepics.m
yspace.com
search.yahoo.com
m
yspace.com
m
sn.com
am
azon.com
shop.ebay.com
yahoo.com
im
ages.google.com
hom
e.m
yspace.com
m
ail.com
cast.net
bing.com
w
w
w
.yahoo.com
cgi.ebay.com
espn.go.com
m
essaging.m
yspace.com
tw
itter.com
cim
.m
eebo.com
m
y.ebay.com
en.w
ikipedia.org
login.yahoo.com
facebook.m
afiawars.com
m
y.yahoo.com
gam
e3.pogo.com
friends.m
yspace.com
tagged.com
w
orldw
inner.com
m
eebo.com
login.live.com
m
ypoints.com
m
aps.google.com
aol.com
pogo.com
m
w
m
s.zynga.com
new
s.yahoo.com
w
inster.com
netflix.com
fantasysports.yahoo.com
search.aol.com
com
cast.net
alotm
etrics.com
white
non.white
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 17 / 33
30. Diversity of the Web
Individual-level prediction
How well can one predict an individual’s demographics from their
browsing activity?
• Represent each user by the set of sites visited
• Fit linear models to predict majority/minority for each
attribute on 80% of users
• Tune model parameters using a 10% validation set
• Evaluate final performance on held-out 10% test set
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 19 / 33
31. Diversity of the Web
GNU-fu
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 20 / 33
32. Diversity of the Web
Individual-level prediction
• Reasonable (∼70-85%)
accuracy and AUC across all
attributes
• Similar performance even
when restricted to top 1k
sites
• Can achieve substantially
better performance when
restricted to “stereotypical”
users (∼80-90%)
College/No College
Under/Over $50,000
Household Income
White/Non−White
Female/Male
Over/Under 25
Years Old
AUC
q
q
q
q
q
.5 .6 .7 .8 .9 1
Accuracy
q
q
q
q
q
.5 .6 .7 .8 .9 1
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 21 / 33
33. Diversity of the Web
Individual-level prediction
Highly-weighted sites under the fitted models
Large positive weight Large negative weight
Female
winster.com
lancome-usa.com
sports.yahoo.com
espn.go.com
White
marlboro.com
cmt.com
mediatakeout.com
bet.com
College Educated
news.yahoo.com
linkedin.com
youtube.com
myspace.com
Over 25 Years Old
evite.com
classmates.com
addictinggames.com
youtube.com
Household Income
Under $50,000
eharmony.com
tracfone.com
rownine.com
matrixdirect.com
Table 2: A selection of the most predictive (i.e., most highly weighted) sites for each classification task.
College/No College
Under/Over $50,000
Household Income
White/Non−White
Female/Male
Over/Under 25
Years Old
AUC
!
!
!
!
!
Accuracy
!
!
!
!
!
Figure 7, a measure that effectively re-normalizes the ma-
jority and minority classes to have equal size. Intuitively,
AUC is the probability that a model scores a randomly se-
lected positive example higher than a randomly selected neg-
ative one (e.g., the probability that the model correctly dis-
tinguishes between a randomly selected female and male).
Though an uninformative rule would correctly discriminateJake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 22 / 33
34. Diversity of the Web
Individual-level prediction
Proof of concept browser demo
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 23 / 33
35. Diversity of the Web
Individual-level prediction
Proof of concept browser demo
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 23 / 33
36. Diversity of the Web
The real story
(what we actually did)
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 24 / 33
37. Diversity on the Web
The real story
• Got several hundred GBs of MegaPanel data from Nielsen3
3
Special thanks to Mainak Mazumdar
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 25 / 33
38. Diversity on the Web
The real story
• Got several hundred GBs of MegaPanel data from Nielsen3
• Discussed possible projects
• Predict user demographics (e.g. real-valued age) from a few
minutes of browsing activity for ad-targeting?
• Infer the number of individuals using the same browser or
behind the same ip?
• Determine number of actual uniques advertisers are receiving?
• . . .
3
Special thanks to Mainak Mazumdar
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 25 / 33
39. Diversity on the Web
The real story (cont’d)
• Started with predicting real-valued age
• Worked on this for an embarassingly long time
(various methods, feature selection, etc.)
• Turns out to be difficult to do better than within 10 years of
true age, on average
• Settled for classification on binary outcomes (e.g.,
adult/non-adult) over entire history
• Classification worked reasonably well for age and other
attributes
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 26 / 33
40. Diversity on the Web
The real story (cont’d)
• Became curious about why classification worked well
compared to regression
• Generated descriptive statistics across all attributes at the site
and group levels
• Compared site statistics to ZIP code data from the US Census
• Compared time distribution across groups
• Realized that we now had the largest comprehensive study of
demographic diversity on the web
Jake Hofman (@jakehofman) Learning from Web Activity TimesOpen, 2010.11.18 27 / 33