Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Internet Archives as a Tool for Research: Decay in Large Scale Archival Records
1. Matthew S. Weber
Hai Nguyen
Rutgers University
IEEE Big Data Congress 2015
Millenium Hotel, NY, NY
Wednesday, July 1, 2015
BIG DATA,
BIG ISSUES
2.
3. 3
Dataset Research Potential Dates Captures Unique URLs
Hurricane Katrina Online networks and organizational
resilience (Chewning, Lai and Doerfel,
2012; Perry, Taylor and Doerfel, 2003) in
the wake of disasters; information
dissemination
2003 – 2012 1,694,236 663,740
Superstorm
Sandy
2003 – 2012 41,703,112 20,013,455
US Senate Study the growth of political activity in
online environments (Adamic & Glance,
2005; Bruns, 2007; Chang & Park, 2012);
polarization & media discourse
109th – 112th
Congresses
26,965,770 8,674,397
US House 51,840,777 12,410,014
Occupy Wall
Street
Previous research on NGOs in the online
environment (Bach & Stark, 2004;
Shumate, 2003, 2012; Shumate, Fulk, &
Monge, 2005); use of hyperlink data to
study the formation and role of alliances
between SMOs
2010 – 2012 247,928,272 11,3259,655
US Media
Previous studies of news media
organizations (Greer & Mensing, 2006;
Weber, 2012; Weber & Monge, In
Press); focus on evolutionary patterns
2008 – 2012 1,315,132,555 539,184,823
4. What’s in the data?
4
Source | Destination | Date | Frequency | Content Type | Bytes | Descriptive Text
Link Data:
http://gawker.com/5953665/mitt-romneys-
staff-played-the-media-covering-them-in-a-
friendly-game-of-flag-football
Mitt Romney's Staff Played the Media Covering
Them in a Friendly Game of Flag
http://gawker.com
2012-10-22
20. • Scale out across multiple datasets:
– US House – 2005:2013:
– US Senate – 2005:2013
– Hurrican Katrina – 2003:2012:
– Occupy Wall Street – 2010:2012
20
21. 0 5 10 15 20 25 30
050000010000001500000200000025000003000000
Potential vs. Actual URLs
CountofPages
21t
CountofURLs
Potential
Actual
Difference
22. 22
0e+002e+064e+066e+06
Changes in Crawl Completeness
CountofPages
t
CountofURLs
OWS
House
Senate
Katrina
existing
potential
b =
set a unit of time for analysis, c
choosing n perios across a total time T
23. In the ideal case, it would be possible to create a factor that corrects
for data degrade:
bt
How does this help?
Each of the illustrated cases fits against an
exponential function ~ b
• Senate: 0.13
• House: 0.13
• Katrina: 0.02
• OWS: 0.10
23
ebt
27. Lessons Learned
• Degradation is a factor in working with available large-scale data
– In part, degradation is related to the provenance of the data
– In turn, there is a need to record the origins of datasets (provenance)
• Patterns of degradation prove problematic for statistical analyses
– Ex: network analysis with snowball samples vs. whole network
• Continued work needed to develop research guidelines as more
scholars engage with this data
27
28. Get in contact with us:
– matthew.weber@rutgers.edu
– @mediareinvented
The Team
– Kris Carpenter, Vinay Goel, Internet Archive
– David Lazer, Katherine Ognyanova, Northeastern University
– Allie Kosterich, Hai Nguyen, Rutgers University
Research supported by NSF Award #1244727 and the NetSCI Lab @ Rutgers
Hinweis der Redaktion
There are many types of large-scale data… only talking about Internet based data… focusing on datasets that are re-used.
- Markus - “social scientists are used to fine-grain, well-controlled data, and that doesn’t exist on the web”
20th Century Collection = 9TB of metadata
Media Seed List = 4,891
For instance, researchers have proposed focusing archival efforts on capturing data that changes the most frequently, in order to capture the majority of new content [36]. Elsewhere, researchers have suggested that crawling strategies should prioritize archival efforts based on the size and relative position of websites within their larger ecosystems [37].
150 TB storage… main compute pool has 72 compute nodes w/ 128GB memory per node
Correlations between outgoing link vectors to show profile similarities
Driscoll and Walker (2014) For instance, a comparison of Twitter data collected via a public API and data collected from a “fire hose” provided by GNIP PowerTrack, found significant differences between the two datasets. In most cases the PowerTrack data proved to be more powerful,
3 month windows of time…
Also looked at the size of the webpages, and estimating out size… wasn’t as reliable.