Internet Archives as a Tool for Research: Decay in Large Scale Archival Records

•

1 gefällt mir•314 views

Internet Archives as a Tool for Research: Decay in Large Scale Archival Records. Paper presented at the IEEE Big Data Congress 2015, New York, NY.

Daten & Analysen

Matthew S. Weber
Hai Nguyen
Rutgers University
IEEE Big Data Congress 2015
Millenium Hotel, NY, NY
Wednesday, July 1, 2015
BIG DATA,
BIG ISSUES

3
Dataset Research Potential Dates Captures Unique URLs
Hurricane Katrina Online networks and organizational
resilience (Chewning, Lai and Doerfel,
2012; Perry, Taylor and Doerfel, 2003) in
the wake of disasters; information
dissemination
2003 – 2012 1,694,236 663,740
Superstorm
Sandy
2003 – 2012 41,703,112 20,013,455
US Senate Study the growth of political activity in
online environments (Adamic & Glance,
2005; Bruns, 2007; Chang & Park, 2012);
polarization & media discourse
109th – 112th
Congresses
26,965,770 8,674,397
US House 51,840,777 12,410,014
Occupy Wall
Street
Previous research on NGOs in the online
environment (Bach & Stark, 2004;
Shumate, 2003, 2012; Shumate, Fulk, &
Monge, 2005); use of hyperlink data to
study the formation and role of alliances
between SMOs
2010 – 2012 247,928,272 11,3259,655
US Media
Previous studies of news media
organizations (Greer & Mensing, 2006;
Weber, 2012; Weber & Monge, In
Press); focus on evolutionary patterns
2008 – 2012 1,315,132,555 539,184,823

What’s in the data?
4
Source | Destination | Date | Frequency | Content Type | Bytes | Descriptive Text
Link Data:
http://gawker.com/5953665/mitt-romneys-
staff-played-the-media-covering-them-in-a-
friendly-game-of-flag-football
Mitt Romney's Staff Played the Media Covering
Them in a Friendly Game of Flag
http://gawker.com
2012-10-22

7
News Media on the Web
(Weber, Ognyanova, Kosterich & Nguyen, 2015)

To what degree are large-scale datasets reliable?

• Scale out across multiple datasets:
– US House – 2005:2013:
– US Senate – 2005:2013
– Hurrican Katrina – 2003:2012:
– Occupy Wall Street – 2010:2012
20

0 5 10 15 20 25 30
050000010000001500000200000025000003000000
Potential vs. Actual URLs
CountofPages
21t
CountofURLs
Potential
Actual
Difference

22
0e+002e+064e+066e+06
Changes in Crawl Completeness
CountofPages
t
CountofURLs
OWS
House
Senate
Katrina
existing
potential
b =
set a unit of time for analysis, c
choosing n perios across a total time T

In the ideal case, it would be possible to create a factor that corrects
for data degrade:
bt
How does this help?
Each of the illustrated cases fits against an
exponential function ~ b
• Senate: 0.13
• House: 0.13
• Katrina: 0.02
• OWS: 0.10
23
ebt

26
Challenges are not unique to these
data
Courtesy of Marc Smith, NodeXL

Lessons Learned
• Degradation is a factor in working with available large-scale data
– In part, degradation is related to the provenance of the data
– In turn, there is a need to record the origins of datasets (provenance)
• Patterns of degradation prove problematic for statistical analyses
– Ex: network analysis with snowball samples vs. whole network
• Continued work needed to develop research guidelines as more
scholars engage with this data
27

Get in contact with us:
– matthew.weber@rutgers.edu
– @mediareinvented
The Team
– Kris Carpenter, Vinay Goel, Internet Archive
– David Lazer, Katherine Ognyanova, Northeastern University
– Allie Kosterich, Hai Nguyen, Rutgers University
Research supported by NSF Award #1244727 and the NetSCI Lab @ Rutgers

Weitere ähnliche Inhalte

Was ist angesagt?

Pie chart or pizza: identifying chart types and their virality on TwitterElena Simperl

Public Data In The CloudOmer Trajman

Data, Infrastructures and Geographical ImaginationsCommunication and Media Studies, Carleton University

Data PowerCommunication and Media Studies, Carleton University

A Framework for Citizen e-Participation in Disaster ManagementGuido Lang

Building a first generation cyberinfrastructure to support ecological forecas...Joshua Campbell

Today's Data Grow Tomorrow's CitizensCommunication and Media Studies, Carleton University

Social GeosemanticsDiegoCerda

Critical Data Studies in the AcademyCommunication and Media Studies, Carleton University

One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...Elena Simperl

Evolution of GIS Technologies in a Web 2.0pdscomp

Big Data Challenges for the Social SciencesDavid De Roure

Dissertation Abstract non-techKaren Morton

Was ist angesagt? (13)

Pie chart or pizza: identifying chart types and their virality on Twitter

Public Data In The Cloud

Data, Infrastructures and Geographical Imaginations

Data Power

A Framework for Citizen e-Participation in Disaster Management

Building a first generation cyberinfrastructure to support ecological forecas...

Today's Data Grow Tomorrow's Citizens

Social Geosemantics

Critical Data Studies in the Academy

One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...

Evolution of GIS Technologies in a Web 2.0

Big Data Challenges for the Social Sciences

Dissertation Abstract non-tech

Ähnlich wie Internet Archives as a Tool for Research: Decay in Large Scale Archival Records

Internet Archives and Social Science Research - Yeungnam Universitymwe400

Data Science in 2016: Moving UpPaco Nathan

Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain

Examples of Real-World Big Data ApplicationArtificial Intelligence Institute at UofSC

NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...SayantanRoy14

Linked Data and the Future Internet Architecture: A motivation: Stefan Decker...FIA2010

wireless sensor networkparry prabhu

Challenges in-archiving-twitterKatrin Weller

Cross-Disciplinary Insights on Big Data Challenges and SolutionsBYTE Project

Open Innovation - Winter 2014 - Socrata, Inc.Socrata

CeB - f - s01gauvins

Using Graphs to Enable National-Scale AnalyticsNeo4j

A Survey on Big Data Mining ChallengesEditor IJMTER

Report case study big dataAjay Alex

Big Data Talent in Academic and Industry R&DUniversity of Washington

10 problems 06Loc Nguyễn

Tech Jam 2015: Agenda US-Ignite

Kid171 chap0 english versionFrank S.C. Tseng

Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...Lauri Eloranta

Enabling Collaborative Analytics for Faster Answers in CrisisKate Chapman

Ähnlich wie Internet Archives as a Tool for Research: Decay in Large Scale Archival Records (20)

Internet Archives and Social Science Research - Yeungnam University

Data Science in 2016: Moving Up

Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015

Examples of Real-World Big Data Application

NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...

Linked Data and the Future Internet Architecture: A motivation: Stefan Decker...

wireless sensor network

Challenges in-archiving-twitter

Cross-Disciplinary Insights on Big Data Challenges and Solutions

Open Innovation - Winter 2014 - Socrata, Inc.

CeB - f - s01

Using Graphs to Enable National-Scale Analytics

A Survey on Big Data Mining Challenges

Report case study big data

Big Data Talent in Academic and Industry R&D

10 problems 06

Tech Jam 2015: Agenda

Kid171 chap0 english version

Ethical and Legal Issues in Computational Social Science - Lecture 7 in Intro...

Enabling Collaborative Analytics for Faster Answers in Crisis

Mehr von mwe400

050817 geomedia news networksmwe400

022217 ia hackathon presentationmwe400

062016 jcdl media networks uploadmwe400

Web Archives and Data Challenges - Archives Unleashedmwe400

Immutable Technology and the Breakdown of Organizational Change.mwe400

032415 marketing 101 watershed uploadmwe400

AEJMC 2014 - Big Data and Educationmwe400

AEJMC 2014 - Online News and Linkingmwe400

Mehr von mwe400 (8)

050817 geomedia news networks

022217 ia hackathon presentation

062016 jcdl media networks upload

Web Archives and Data Challenges - Archives Unleashed

Immutable Technology and the Breakdown of Organizational Change.

032415 marketing 101 watershed upload

AEJMC 2014 - Big Data and Education

AEJMC 2014 - Online News and Linking

Kürzlich hochgeladen

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop

Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131

7. Epi of Chronic respiratory diseases.pptibrahimabdi22

Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila

💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...vershagrag

Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation

Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg

Identify Customer Segments to Create Customer Offers for Each Segment - Appli...ThinkInnovation

Case Study 4 Where the cry of rebellion happen?RemarkSemacio

Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridihmeghakumariji156

Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls

TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro

Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls

$Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...$ $Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...$

Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...HyderabadDolls

Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Digital Transformation Playbook by Graham WareGraham Ware

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg

Kürzlich hochgeladen (20)

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...

Dubai Call Girls Peeing O525547819 Call Girls Dubai

7. Epi of Chronic respiratory diseases.ppt

Aspirational Block Program Block Syaldey District - Almora

💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...

Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange

Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...

Identify Customer Segments to Create Customer Offers for Each Segment - Appli...

Case Study 4 Where the cry of rebellion happen?

Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih

Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...

TrafficWave Generator Will Instantly drive targeted and engaging traffic back...

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now

Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...

$Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...$ $Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...$

Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...

Abortion pills in Jeddah | +966572737505 | Get Cytotec

Digital Transformation Playbook by Graham Ware

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...

Internet Archives as a Tool for Research: Decay in Large Scale Archival Records

1. Matthew S. Weber Hai Nguyen Rutgers University IEEE Big Data Congress 2015 Millenium Hotel, NY, NY Wednesday, July 1, 2015 BIG DATA, BIG ISSUES

3. 3 Dataset Research Potential Dates Captures Unique URLs Hurricane Katrina Online networks and organizational resilience (Chewning, Lai and Doerfel, 2012; Perry, Taylor and Doerfel, 2003) in the wake of disasters; information dissemination 2003 – 2012 1,694,236 663,740 Superstorm Sandy 2003 – 2012 41,703,112 20,013,455 US Senate Study the growth of political activity in online environments (Adamic & Glance, 2005; Bruns, 2007; Chang & Park, 2012); polarization & media discourse 109th – 112th Congresses 26,965,770 8,674,397 US House 51,840,777 12,410,014 Occupy Wall Street Previous research on NGOs in the online environment (Bach & Stark, 2004; Shumate, 2003, 2012; Shumate, Fulk, & Monge, 2005); use of hyperlink data to study the formation and role of alliances between SMOs 2010 – 2012 247,928,272 11,3259,655 US Media Previous studies of news media organizations (Greer & Mensing, 2006; Weber, 2012; Weber & Monge, In Press); focus on evolutionary patterns 2008 – 2012 1,315,132,555 539,184,823

4. What’s in the data? 4 Source | Destination | Date | Frequency | Content Type | Bytes | Descriptive Text Link Data: http://gawker.com/5953665/mitt-romneys- staff-played-the-media-covering-them-in-a- friendly-game-of-flag-football Mitt Romney's Staff Played the Media Covering Them in a Friendly Game of Flag http://gawker.com 2012-10-22

5. 5

6. 6

7. 7 News Media on the Web (Weber, Ognyanova, Kosterich & Nguyen, 2015)

10. To what degree are large-scale datasets reliable?

11. 11

12. 12

13. 13

14. 14

15. 15

16. 16

17. 17 March 16, 2008

18. 18

19. 19

20. • Scale out across multiple datasets: – US House – 2005:2013: – US Senate – 2005:2013 – Hurrican Katrina – 2003:2012: – Occupy Wall Street – 2010:2012 20

21. 0 5 10 15 20 25 30 050000010000001500000200000025000003000000 Potential vs. Actual URLs CountofPages 21t CountofURLs Potential Actual Difference

22. 22 0e+002e+064e+066e+06 Changes in Crawl Completeness CountofPages t CountofURLs OWS House Senate Katrina existing potential b = set a unit of time for analysis, c choosing n perios across a total time T

23. In the ideal case, it would be possible to create a factor that corrects for data degrade: bt How does this help? Each of the illustrated cases fits against an exponential function ~ b • Senate: 0.13 • House: 0.13 • Katrina: 0.02 • OWS: 0.10 23 ebt

24. 24

25. 25

26. 26 Challenges are not unique to these data Courtesy of Marc Smith, NodeXL

27. Lessons Learned • Degradation is a factor in working with available large-scale data – In part, degradation is related to the provenance of the data – In turn, there is a need to record the origins of datasets (provenance) • Patterns of degradation prove problematic for statistical analyses – Ex: network analysis with snowball samples vs. whole network • Continued work needed to develop research guidelines as more scholars engage with this data 27

28. Get in contact with us: – matthew.weber@rutgers.edu – @mediareinvented The Team – Kris Carpenter, Vinay Goel, Internet Archive – David Lazer, Katherine Ognyanova, Northeastern University – Allie Kosterich, Hai Nguyen, Rutgers University Research supported by NSF Award #1244727 and the NetSCI Lab @ Rutgers

Hinweis der Redaktion

There are many types of large-scale data… only talking about Internet based data… focusing on datasets that are re-used. - Markus - “social scientists are used to fine-grain, well-controlled data, and that doesn’t exist on the web”
20th Century Collection = 9TB of metadata Media Seed List = 4,891 For instance, researchers have proposed focusing archival efforts on capturing data that changes the most frequently, in order to capture the majority of new content [36]. Elsewhere, researchers have suggested that crawling strategies should prioritize archival efforts based on the size and relative position of websites within their larger ecosystems [37].
150 TB storage… main compute pool has 72 compute nodes w/ 128GB memory per node
Correlations between outgoing link vectors to show profile similarities
Driscoll and Walker (2014) For instance, a comparison of Twitter data collected via a public API and data collected from a “fire hose” provided by GNIP PowerTrack, found significant differences between the two datasets. In most cases the PowerTrack data proved to be more powerful,
3 month windows of time…
Also looked at the size of the webpages, and estimating out size… wasn’t as reliable.

Internet Archives as a Tool for Research: Decay in Large Scale Archival Records

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (13)

Ähnlich wie Internet Archives as a Tool for Research: Decay in Large Scale Archival Records

Ähnlich wie Internet Archives as a Tool for Research: Decay in Large Scale Archival Records (20)

Mehr von mwe400

Mehr von mwe400 (8)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Internet Archives as a Tool for Research: Decay in Large Scale Archival Records

Hinweis der Redaktion