Matthew S. Weber presented research at the 130th Annual Meeting of Big Data, Big Theory & The Thread of Recent History. The presentation discussed analyzing large-scale datasets to study how complete they are over time. It found datasets on political events and natural disasters became less complete as more webpages and URLs were added over multiple crawls. However, the rate of incompleteness followed exponential functions and could be corrected for using established factors for each dataset. While reliability challenges are not unique, understanding degradation rates can help researchers account for gaps in large internet-sourced datasets.
7. What’s in the data?
7
Source | Destination | Date | Frequency | Content Type | Bytes | Content
Link Data:
http://gawker.com/5953665/mitt-romneys-
staff-played-the-media-covering-them-in-a-
friendly-game-of-flag-football
Mitt Romney's Staff Played the Media Covering
Them in a Friendly Game of Flag
http://gawker.com
2012-10-22
18. 18
Dataset Research Potential Dates Captures Unique URLs
Hurricane Katrina Online networks and organizational
resilience (Chewning, Lai and Doerfel,
2012; Perry, Taylor and Doerfel, 2003) in
the wake of disasters; information
dissemination
2003 – 2012 1,694,236 663,740
Superstorm
Sandy
2003 – 2012 41,703,112 20,013,455
US Senate Study the growth of political activity in
online environments (Adamic & Glance,
2005; Bruns, 2007; Chang & Park, 2012);
polarization & media discourse
109th – 112th
Congresses
26,965,770 8,674,397
US House 51,840,777 12,410,014
Occupy Wall
Street
Previous research on NGOs in the online
environment (Bach & Stark, 2004;
Shumate, 2003, 2012; Shumate, Fulk, &
Monge, 2005); use of hyperlink data to
study the formation and role of alliances
between SMOs
2010 – 2012 247,928,272 11,3259,655
US Media
Previous studies of news media
organizations (Greer & Mensing, 2006;
Weber, 2012; Weber & Monge, In
Press); focus on evolutionary patterns
2008 – 2012 1,315,132,555 539,184,823
23. 0 5 10 15 20 25 30
050000010000001500000200000025000003000000
Potential vs. Actual URLs
CountofPages
23t
CountofURLs
Potential
Actual
Difference
24. 24
0e+002e+064e+066e+06
Changes in Crawl Completeness
CountofPages
t
CountofURLs
OWS
House
Senate
Katrina
existing
potential
b =
set a unit of time for analysis, c
choosing n periods across a total time T
25. In the ideal case, it would be possible to create a factor that corrects
for data degrade:
bt
How does this help?
Each of the illustrated cases fits against an
exponential function ~ b
• Senate: 0.13
• House: 0.13
• Katrina: 0.02
• OWS: 0.10
25
ebt
Emporer Penguins… huddling together for survival... Population... Interacting in a large ecosystem with other animals.
Emporer Penguins… huddling together for survival... Population... Interacting in a large ecosystem with other animals.
WhiteHouse.gov press release from May 1, 2003, archived on May 6, 2003
WhiteHouse.gov press release from May 1, 2003, archived on October 1, 2003
July 14, 2006
July 14, 2006
February 25 2011
Correlations between outgoing link vectors to show profile similarities
20th Century Collection = 9TB of metadata
Media Seed List = 4,891
For instance, researchers have proposed focusing archival efforts on capturing data that changes the most frequently, in order to capture the majority of new content [36]. Elsewhere, researchers have suggested that crawling strategies should prioritize archival efforts based on the size and relative position of websites within their larger ecosystems [37].
Driscoll and Walker (2014) For instance, a comparison of Twitter data collected via a public API and data collected from a “fire hose” provided by GNIP PowerTrack, found significant differences between the two datasets. In most cases the PowerTrack data proved to be more powerful,