SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
An Automated Snowball Census
      of the Political Web

             Abe Gong
        University of Michigan
             JITP 2011
Motivation
Motivation
Motivation
Motivation

The blogosphere is one of the best sources of
political data in all history.


Understanding political bloggers can help us
understand political participation more broadly.


In order to compare “the average blogger” to
“the average citizen,” we need a representative
sample of bloggers.
Wanted: A sampling frame for
all political bloggers
Challenges: scale and sparseness


    No complete index of blogs exists,
    let alone political blogs
•
    250 million web sites
•
    40 new sites created every minutes
•
    Only 3 in 1,000 sites are political
Previous research

                   Examples
    Sample Types
                   ●   Johnson and Kaye,
•
    Convenience        2004
                   ●   Lescovek, Backstrom
                       and Kleinberg, 2009


                   Big Data, but no attempt
                   at representativeness
Previous research

    Sample Types   Examples
•
    Convenience     •
                        McKenna and
                        Pole, 2008
•
    Prominence
                    •
                        Wallsten, 2008


                   Good data, but
                   only includes
                   popular sites.
Previous research

    Sample Types   Examples
                   •   Hindman,
•
    Convenience        Tsioutsiouliklis, and
•
    Prominence         Johnson, 2003
                   •   Karpf, 2008
•
    Snowball
                   Sample properties
                   unclear
Previous research

                   Examples
    Sample Types   •   Lenhart and Fox, 2006
•
    Convenience    •   Schlozman, Verba, and
                       Brady, 2010
•
    Prominence     •   Lawrence, Sides, and
                       Farrell, 2010
•
    Snowball       •   Karen's US-IMPACT study

•
    Over-sample    Representative sample, but
                   linking to Big Data is hard
Methodology

1. Start from a seed batch of political sites.
2. Download and classify each site in the
batch.
3. For political sites, harvest outbound
hyperlinks and add unvisited links to the
next batch.
4. Repeat from step 2 until no new links are
found.
Toy Example
Toy Example
Toy Example
Toy Example
Bag-of-words logit regression

Prob(political) ≈ logit(α+βX)
  X = Vector of word counts
  α = Bias term
  β = Word weights


1. Hand-code a training sample (n=2,000)
2. Calibrate the computer
3. Hand-code a testing sample (n=200)
4. Evaluate the classifier
Text Classifier Word Cloud
Classifier reliability



    Human-human:         80.9%
    Human-computer: 81.0%


    Krippendorff's Alpha: .733
Census Results

Implemented in python: SnowCrawl
 Executes in less than 24 hours
 1.8 million sites crawled
 800,000 political
 42% blogs


                     http://code.google.com/p/snowcrawl
Comparison by strata

                   Top 500   Top 5,000   Census
Organization
Owned by orgs      66.1***   53.1        44.4
Multiple authors   75.2*     66.7        62.2
M-updates/day      43.4***   19.4***      6.1

Design
Advertising        67.3**    57.1        51.2
Blogroll           57.5*     66.3***     45.1
Video              48.7***   35.7***     18.3
Comparison by strata


                             Top 500   Top 5,000   Census
Polls and public opinion     70.8***   65.3*       52.4
Elections and campaigns      50.4      45.9        51.2
Legislation and law-making   43.4      41.8        43.9
Implementation of policy     38.1      39.8        30.5
Decisions by courts          34.5***   24.5        17.1
Political figures            46.0***   39.8**      24.4
Political parties            38.9***   32.7*       20.7
Philosophical discussion     26.5      29.6        25.6
State and local government   36.3*     38.8**      24.4
Foreign policy               42.5***   38.8***     15.9
International relations      31.9**    33.7**      18.3
Where next?

●
    Survey of bloggers
●
    Poststratification weighting
●
    Network analysis
●
    Content analysis of blogs
●
    Blog post panel
●
    Sentiment analysis/Survey imputation
●
    Re-implement in Hadoop
Where next ...?


                            ?



                  ANES
                                ?



                      GSS


                  ?

                      Roxy...?
Conclusions

1. Combinations of tools are
   much more powerful than
   individual tools – share ideas
   across disciplines.


2. Sampling matters! With a
   little extra effort, we can
   sample populations on the
   web.


3. Complementary data is the
   key for the compSocSci
   research agenda.
Conclusions

1. Combinations of tools are
   much more powerful than
   individual tools – share ideas
   across disciplines.


2. Sampling matters! With a
   little extra effort, we can
   sample populations on the
   web.

                                    http://code.google.com/p/snowcrawl
3. Complementary data is the
   key for the compSocSci
   research agenda.
Conclusions

1. Combinations of tools are
   much more powerful than
   individual tools – share ideas
   across disciplines.



2. Sampling matters! With a little
   extra effort, we can sample
   populations on the web.



3. Complementary, horizontal,
   and offline data is key for the
   compSocSci research agenda.
Thank you!



             Questions? Comments?



                        Abe Gong
     Public policy, political science, complex systems
                  University of Michigan
                   agong@umich.edu
                 lowlywonk.blogspot.com
            Www-personal.umich.edu/~agong
An Automated Snowball Census of the Political Web - JITP 2011

Weitere ähnliche Inhalte

Ähnlich wie An Automated Snowball Census of the Political Web - JITP 2011

Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Digital Methods Initiative
 
Visualizing communication at scad school of design
Visualizing communication at scad school of designVisualizing communication at scad school of design
Visualizing communication at scad school of design
SAAD ALZAROONI, CM
 
Lida change-reference-abels
Lida change-reference-abelsLida change-reference-abels
Lida change-reference-abels
fpehar
 
Keynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official PublicationsKeynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official Publications
maartenmarx
 

Ähnlich wie An Automated Snowball Census of the Political Web - JITP 2011 (20)

Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
 
Immersive Recommendation
Immersive RecommendationImmersive Recommendation
Immersive Recommendation
 
Lecture4 Social Web
Lecture4 Social Web Lecture4 Social Web
Lecture4 Social Web
 
Gunderman, Slayton, and Wang, "Planning for the Long-Term"
Gunderman, Slayton, and Wang, "Planning for the Long-Term"Gunderman, Slayton, and Wang, "Planning for the Long-Term"
Gunderman, Slayton, and Wang, "Planning for the Long-Term"
 
Netnography webinar
Netnography webinarNetnography webinar
Netnography webinar
 
Analyzing data about our data
Analyzing data about our dataAnalyzing data about our data
Analyzing data about our data
 
Towards Research Engines: Supporting Search Stages in Web Archives (2015)
Towards Research Engines: Supporting Search Stages in Web Archives (2015)Towards Research Engines: Supporting Search Stages in Web Archives (2015)
Towards Research Engines: Supporting Search Stages in Web Archives (2015)
 
Practical Applications for Social Network Analysis in Public Sector Marketing...
Practical Applications for Social Network Analysis in Public Sector Marketing...Practical Applications for Social Network Analysis in Public Sector Marketing...
Practical Applications for Social Network Analysis in Public Sector Marketing...
 
Lecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationLecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and Visualisation
 
A Social Cloud for Public eResearch
A Social Cloud for Public eResearchA Social Cloud for Public eResearch
A Social Cloud for Public eResearch
 
Visualizing communication at scad school of design
Visualizing communication at scad school of designVisualizing communication at scad school of design
Visualizing communication at scad school of design
 
Studying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and ResidentsStudying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and Residents
 
Studying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and ResidentsStudying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and Residents
 
Lida change-reference-abels
Lida change-reference-abelsLida change-reference-abels
Lida change-reference-abels
 
Online Communities in Citizen Science
Online Communities in Citizen ScienceOnline Communities in Citizen Science
Online Communities in Citizen Science
 
Keynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official PublicationsKeynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official Publications
 
Research Life Cycle for GeoData 2014
Research Life Cycle for GeoData 2014Research Life Cycle for GeoData 2014
Research Life Cycle for GeoData 2014
 
Information Architecture Workshop
Information Architecture WorkshopInformation Architecture Workshop
Information Architecture Workshop
 
Summer Social Webshop: Technology-Mediated Social Participation
Summer Social Webshop: Technology-Mediated Social ParticipationSummer Social Webshop: Technology-Mediated Social Participation
Summer Social Webshop: Technology-Mediated Social Participation
 
Visualising activity in learning networks using open data and educational ...
Visualising activity in learning networks   using open data and educational  ...Visualising activity in learning networks   using open data and educational  ...
Visualising activity in learning networks using open data and educational ...
 

Mehr von Abe Gong

Gong info heist
Gong info heistGong info heist
Gong info heist
Abe Gong
 

Mehr von Abe Gong (7)

The Edison Moment for the Internet of You
The Edison Moment for the Internet of YouThe Edison Moment for the Internet of You
The Edison Moment for the Internet of You
 
Building for resilience
Building for resilienceBuilding for resilience
Building for resilience
 
Building for resilience (with speaking notes)
Building for resilience (with speaking notes)Building for resilience (with speaking notes)
Building for resilience (with speaking notes)
 
The Sidekick Pattern: Strata talk by Abe Gong
The Sidekick Pattern: Strata talk by Abe GongThe Sidekick Pattern: Strata talk by Abe Gong
The Sidekick Pattern: Strata talk by Abe Gong
 
How to ride, eat, tame, etc. your personal elephant
How to ride, eat, tame, etc. your personal elephantHow to ride, eat, tame, etc. your personal elephant
How to ride, eat, tame, etc. your personal elephant
 
Picking programming packages
Picking programming packagesPicking programming packages
Picking programming packages
 
Gong info heist
Gong info heistGong info heist
Gong info heist
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

An Automated Snowball Census of the Political Web - JITP 2011

  • 1. An Automated Snowball Census of the Political Web Abe Gong University of Michigan JITP 2011
  • 5. Motivation The blogosphere is one of the best sources of political data in all history. Understanding political bloggers can help us understand political participation more broadly. In order to compare “the average blogger” to “the average citizen,” we need a representative sample of bloggers.
  • 6. Wanted: A sampling frame for all political bloggers
  • 7. Challenges: scale and sparseness No complete index of blogs exists, let alone political blogs • 250 million web sites • 40 new sites created every minutes • Only 3 in 1,000 sites are political
  • 8. Previous research Examples Sample Types ● Johnson and Kaye, • Convenience 2004 ● Lescovek, Backstrom and Kleinberg, 2009 Big Data, but no attempt at representativeness
  • 9. Previous research Sample Types Examples • Convenience • McKenna and Pole, 2008 • Prominence • Wallsten, 2008 Good data, but only includes popular sites.
  • 10. Previous research Sample Types Examples • Hindman, • Convenience Tsioutsiouliklis, and • Prominence Johnson, 2003 • Karpf, 2008 • Snowball Sample properties unclear
  • 11. Previous research Examples Sample Types • Lenhart and Fox, 2006 • Convenience • Schlozman, Verba, and Brady, 2010 • Prominence • Lawrence, Sides, and Farrell, 2010 • Snowball • Karen's US-IMPACT study • Over-sample Representative sample, but linking to Big Data is hard
  • 12. Methodology 1. Start from a seed batch of political sites. 2. Download and classify each site in the batch. 3. For political sites, harvest outbound hyperlinks and add unvisited links to the next batch. 4. Repeat from step 2 until no new links are found.
  • 17. Bag-of-words logit regression Prob(political) ≈ logit(α+βX) X = Vector of word counts α = Bias term β = Word weights 1. Hand-code a training sample (n=2,000) 2. Calibrate the computer 3. Hand-code a testing sample (n=200) 4. Evaluate the classifier
  • 19. Classifier reliability Human-human: 80.9% Human-computer: 81.0% Krippendorff's Alpha: .733
  • 20. Census Results Implemented in python: SnowCrawl Executes in less than 24 hours 1.8 million sites crawled 800,000 political 42% blogs http://code.google.com/p/snowcrawl
  • 21. Comparison by strata Top 500 Top 5,000 Census Organization Owned by orgs 66.1*** 53.1 44.4 Multiple authors 75.2* 66.7 62.2 M-updates/day 43.4*** 19.4*** 6.1 Design Advertising 67.3** 57.1 51.2 Blogroll 57.5* 66.3*** 45.1 Video 48.7*** 35.7*** 18.3
  • 22. Comparison by strata Top 500 Top 5,000 Census Polls and public opinion 70.8*** 65.3* 52.4 Elections and campaigns 50.4 45.9 51.2 Legislation and law-making 43.4 41.8 43.9 Implementation of policy 38.1 39.8 30.5 Decisions by courts 34.5*** 24.5 17.1 Political figures 46.0*** 39.8** 24.4 Political parties 38.9*** 32.7* 20.7 Philosophical discussion 26.5 29.6 25.6 State and local government 36.3* 38.8** 24.4 Foreign policy 42.5*** 38.8*** 15.9 International relations 31.9** 33.7** 18.3
  • 23. Where next? ● Survey of bloggers ● Poststratification weighting ● Network analysis ● Content analysis of blogs ● Blog post panel ● Sentiment analysis/Survey imputation ● Re-implement in Hadoop
  • 24. Where next ...? ? ANES ? GSS ? Roxy...?
  • 25. Conclusions 1. Combinations of tools are much more powerful than individual tools – share ideas across disciplines. 2. Sampling matters! With a little extra effort, we can sample populations on the web. 3. Complementary data is the key for the compSocSci research agenda.
  • 26. Conclusions 1. Combinations of tools are much more powerful than individual tools – share ideas across disciplines. 2. Sampling matters! With a little extra effort, we can sample populations on the web. http://code.google.com/p/snowcrawl 3. Complementary data is the key for the compSocSci research agenda.
  • 27. Conclusions 1. Combinations of tools are much more powerful than individual tools – share ideas across disciplines. 2. Sampling matters! With a little extra effort, we can sample populations on the web. 3. Complementary, horizontal, and offline data is key for the compSocSci research agenda.
  • 28. Thank you! Questions? Comments? Abe Gong Public policy, political science, complex systems University of Michigan agong@umich.edu lowlywonk.blogspot.com Www-personal.umich.edu/~agong