TeamStation AI System Report LATAM IT Salaries 2024
Learning from Web Activity
1. Learning from Web Activity
Jake Hofman
Joint work with Irmak Sirer & Sharad Goel
Yahoo! Research
August 1, 2011
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 1 / 42
2. Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 2 / 42
3. “... applies social sciences techniques, including
algorithmic game theory, economics, network analysis,
psychology, ethnography, and mechanism design, to
online situations.”
http://research.yahoo.com
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 3 / 42
4. possible to estimate accurately the age of ori- biodiversification event in the Paleogene (65 to 8. M. Foote, in Evolutionary Patterns, J. B. C. Jackson et al.,
Eds. (Univ. of Chicago Press, Chicago, IL, 2001), vol. 245,
gin of almost all extant genera. It is then possi- 23 million years ago) that is perhaps not yet pp. 245–295.
ble to plot a backward survivorship curve (8) captured in Alroy et al.’s database (5, 7). The 9. M. D. Spalding et al., Bioscience 57, 573 (2007).
for each of the 27 global bivalve provinces (9). jury is still out on what may have caused this 10. S. M. Stanley, Paleobiology 33, 1 (2007).
11. M. J. Benton, B. C. Emerson, Palaeontology 50, 23 (2007).
On the basis of these curves, Krug et al. find event. But we should not lose sight of the fact
that origination rates of marine bivalves in- that the steep rise to prominence of many mod- 10.1126/science.1169410
SOCIAL SCIENCE
A field is emerging that leverages the
capacity to collect and analyze data at a
Computational Social Science scale that may reveal patterns of individual
and group behaviors.
David Lazer,1 Alex Pentland,2 Lada Adamic,3 Sinan Aral,2,4 Albert-László Barabási,5
Devon Brewer,6 Nicholas Christakis,1 Noshir Contractor,7 James Fowler,8 Myron Gutmann,3
Tony Jebara,9 Gary King,1 Michael Macy,10 Deb Roy,2 Marshall Van Alstyne2,11
W“... a computational social science is emerging that
e live life in the network. We check ment agencies such as the U.S. National Secur- critiqued or replicated. Neither scenario will
our e-mails regularly, make mobile ity Agency. Computational social science could serve the long-term public interest of accumu-
phone calls from almost any loca- become the exclusive domain of private com- lating, verifying, and disseminating knowledge.
tion, swipe transit cards to use public trans- panies and government agencies. Alternatively, What value might a computational social
leverages the capacity to collect and analyze data with an
portation, and make purchases with credit
cards. Our movements in public places may be
there might emerge a privileged set of aca-
demic researchers presiding over private data
science—based in an open academic environ-
ment—offer society, by enhancing understand-
unprecedented breadth and depth and scale ...”
captured by video cameras, and our medical
records stored as digital files. We may post blog
from which they produce papers that cannot be ing of individuals and collectives? What are the
entries accessible to anyone, or maintain friend-
ships through online social networks. Each of
these transactions leaves digital traces that can
http://sciencemag.org/content/323/5915/721
be compiled into comprehensive pictures of
both individual and group behavior, with the
potential to transform our understanding of our
lives, organizations, and societies.
The capacity to collect and analyze massive
amounts of data has transformed such fields as
biology and physics. But the emergence of a
data-driven “computational social science” has
been much slower. Leading journals in eco-
nomics, sociology, and political science show
little evidence of this field. But computational
social science is occurring—in Internet compa-
nies such as Google and Yahoo, and in govern-
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 4 / 42
5. possible to estimate accurately the age of ori- biodiversification event in the Paleogene (65 to 8. M. Foote, in Evolutionary Patterns, J. B. C. Jackson et al.,
Eds. (Univ. of Chicago Press, Chicago, IL, 2001), vol. 245,
gin of almost all extant genera. It is then possi- 23 million years ago) that is perhaps not yet pp. 245–295.
ble to plot a backward survivorship curve (8) captured in Alroy et al.’s database (5, 7). The 9. M. D. Spalding et al., Bioscience 57, 573 (2007).
for each of the 27 global bivalve provinces (9). jury is still out on what may have caused this 10. S. M. Stanley, Paleobiology 33, 1 (2007).
11. M. J. Benton, B. C. Emerson, Palaeontology 50, 23 (2007).
On the basis of these curves, Krug et al. find event. But we should not lose sight of the fact
that origination rates of marine bivalves in- that the steep rise to prominence of many mod- 10.1126/science.1169410
SOCIAL SCIENCE
A field is emerging that leverages the
capacity to collect and analyze data at a
Computational Social Science scale that may reveal patterns of individual
and group behaviors.
David Lazer,1 Alex Pentland,2 Lada Adamic,3 Sinan Aral,2,4 Albert-László Barabási,5
Devon Brewer,6 Nicholas Christakis,1 Noshir Contractor,7 James Fowler,8 Myron Gutmann,3
Tony Jebara,9 Gary King,1 Michael Macy,10 Deb Roy,2 Marshall Van Alstyne2,11
W“... shares with other nascent interdisciplinary fields
e live life in the network. We check ment agencies such as the U.S. National Secur- critiqued or replicated. Neither scenario will
our e-mails regularly, make mobile ity Agency. Computational social science could serve the long-term public interest of accumu-
phone calls from almost any loca- become the exclusive domain of private com- lating, verifying, and disseminating knowledge.
tion, swipe transit cards to use public trans- panies and government agencies. Alternatively, What value might a computational social
(e.g., sustainability science) the need to develop a
portation, and make purchases with credit
cards. Our movements in public places may be
there might emerge a privileged set of aca-
demic researchers presiding over private data
science—based in an open academic environ-
ment—offer society, by enhancing understand-
paradigm for training new scholars ...”
captured by video cameras, and our medical
records stored as digital files. We may post blog
from which they produce papers that cannot be ing of individuals and collectives? What are the
entries accessible to anyone, or maintain friend-
ships through online social networks. Each of
these transactions leaves digital traces that can
http://sciencemag.org/content/323/5915/721
be compiled into comprehensive pictures of
both individual and group behavior, with the
potential to transform our understanding of our
lives, organizations, and societies.
The capacity to collect and analyze massive
amounts of data has transformed such fields as
biology and physics. But the emergence of a
data-driven “computational social science” has
been much slower. Leading journals in eco-
nomics, sociology, and political science show
little evidence of this field. But computational
social science is occurring—in Internet compa-
nies such as Google and Yahoo, and in govern-
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 4 / 42
6. The clean real story
“ We have a habit in writing articles published in
scientific journals to make the work as finished as
possible, to cover all the tracks, to not worry about the
blind alleys or to describe how you had the wrong idea
first, and so on. So there isn’t any place to publish, in
a dignified manner, what you actually did in order to
get to do the work ...”
-Richard Feynman
Nobel Lecture 1 , 1965
1
http://bit.ly/feynmannobel
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 5 / 42
7. Outline
• The clean story
• The real story
• Lessons learned
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 6 / 42
8. Demographic diversity on the Web
with Irmak Sirer and Sharad Goel
The clean story
(covering our tracks)
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 7 / 42
9. Motivation
Previous work is largely survey-based and focuses and group-level
differences in online access
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 8 / 42
10. Motivation
“As of January 1997, we estimate that 5.2 million
African Americans and 40.8 million whites have ever used
the Web, and that 1.4 million African Americans and
20.3 million whites used the Web in the past week.”
-Hoffman & Novak (1998)
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 8 / 42
11. Motivation
Chang, et. al., ICWSM (2010)
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 9 / 42
12. Motivation
Focus on activity instead of access
How diverse is the Web?
To what extent do online experiences vary across demographic
groups?
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 10 / 42
13. Data
• Representative sample of 265,000 individuals in the US, paid
via the Nielsen MegaPanel2
• Log of anonymized, complete browsing activity from June
2009 through May 2010 (URLs viewed, timestamps, etc.)
• Detailed individual and household demographic information
(age, education, income, race, sex, etc.)
2
Special thanks to Mainak Mazumdar
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 11 / 42
14. Data
• Transform all demographic attributes to binary variables
e.g., Age → Over/Under 25, Race → White/Non-White,
Sex → Female/Male
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 12 / 42
15. Data
• Transform all demographic attributes to binary variables
e.g., Age → Over/Under 25, Race → White/Non-White,
Sex → Female/Male
• Normalize pageviews to at most three domain levels, sans www
e.g. www.yahoo.com → yahoo.com,
us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 12 / 42
16. Data
• Transform all demographic attributes to binary variables
e.g., Age → Over/Under 25, Race → White/Non-White,
Sex → Female/Male
• Normalize pageviews to at most three domain levels, sans www
e.g. www.yahoo.com → yahoo.com,
us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com
• Restrict to top 100k (out of 9M+ total) most popular sites
(by unique visitors)
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 12 / 42
17. Data
• Transform all demographic attributes to binary variables
e.g., Age → Over/Under 25, Race → White/Non-White,
Sex → Female/Male
• Normalize pageviews to at most three domain levels, sans www
e.g. www.yahoo.com → yahoo.com,
us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com
• Restrict to top 100k (out of 9M+ total) most popular sites
(by unique visitors)
• Aggregate activity at the site, group, and user levels
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 12 / 42
18. Site-level skew
How diverse are site audiences?
• For each site and attribute,
calculate the skew in visitors
(e.g., 93% of pageviews on
Density
foxnews.com are by White
users)
• For each attribute, plot the
distribution of visitor skew
across all sites
0.0 0.2 0.4 0.6 0.8 1.0
Proportion White Visitors
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 13 / 42
19. Site-level skew
Density
Density
Density
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Proportion Female Visitors Proportion White Visitors Proportion College Educated Visitors
Density
Density
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0 Proportion of Visitors With
Proportion Adult Visitors Household Incomes Under $50,000
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 14 / 42
20. Site-level skew
Many sites have skew close the overall mean, but there also
popular, highly-skewed sites
Greater Than 90% Less Than 10%
youravon.com coveritlive.com
Female
collectionsetc.com needlive.com
foxnews.com blackplanet.com
White
wunderground.com mediatakeout.com
news.google.com slumz.boxden.com
College Educated
nytimes.com sythe.com
mail.yahoo.com nanowrimo.org
Over 25 Years Old
apps.facebook.com cbox.ws
Household Income scarleteen.com opentable.com
Under $50,000 boards.adultswim.com marketwatch.com
Table 1: A selection of popular sites that are homogeneous along various demographic dimensions.
visually apparent from Figure 5, there are significant differ-
College Female
70
Non−White !
Over 25
ences in how groups distribute their time on the web. These
! !
60
differences—which, as mentioned above, hold for highly fre-
r−Capita Pageviews
!
quented sites such as Facebook and YouTube—are in some
White No College
50
cases even more pronounced for lower traffic sites. For in-
40 Male stance, the gaming site pogo.com accounts for less than 1%
30
of pageviews among both low and high income users, but
Jake Hofman (hofman@yahoo-inc.com) Under 25
Learning from Web Activityusers spend almost twice as much of their time 15 / 42
low income JSM, 2011.08.01
21. Site-level skew
This skew persists even when we restrict attention to the top 10k
or 1k sites
1.0 1.0 1.0
Proportion College Educated Visitors
Proportion Female Site Visitors
Proportion White Site Visitors
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0
101 102 103 104 105 101 102 103 104 105 101 102 103 104 105
Site Rank Site Rank Site Rank
1.0 1.0
Household Incomes Under $50,000
Proportion of Adult Site Visitors
0.8 0.8
Proportion of Visitors With
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
101 102 103 104 105 101 102 103 104 105
Site Rank Site Rank
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 16 / 42
22. Sites vs. ZIPs
How do diversity of the online and offline worlds compare?
Sites Sites
ZIPs ZIPs
Density
Density
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Proportion Female Proportion White
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 17 / 42
23. Sites vs. ZIPs
How do diversity of the online and offline worlds compare?
Sites Sites
ZIPs ZIPs
Density
Density
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Proportion Female Proportion White
As expected, neighborhoods are more gender-balanced than sites
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 17 / 42
24. Sites vs. ZIPs
How do diversity of the online and offline worlds compare?
Sites Sites
ZIPs ZIPs
Density
Density
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Proportion Female Proportion White
But sites typically have more racially diverse audiences than
neighborhoods have residents
Jake Hofman (hofman@yahoo-inc.com) Learning from Web Activity JSM, 2011.08.01 17 / 42