Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Filling in the Blanks: Capturing Dynamically Generated Content

2.558 Aufrufe

Veröffentlicht am

JCDL 2012 Doctoral Consortium presentation by Justin F. Brunelle. Covers the problem Web 2.0 creates for preservation, and proposes a solution for client-side capture of content.

Veröffentlicht in: Technologie
  • Did you try ⇒ www.WritePaper.info ⇐?. They know how to do an amazing essay, research papers or dissertations.
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Get Paid To Write Articles? YES! View 1000s of companies hiring online writers now! ■■■ http://t.cn/AieXS5j0
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Gehören Sie zu den Ersten, denen das gefällt!

Filling in the Blanks: Capturing Dynamically Generated Content

  1. 1. Filling in the Blanks:Capturing Dynamically Generated Content Justin F. Brunelle Old Dominion University Advisor: Dr. Michael L. Nelson JCDL ‘12 Doctoral Consortium 06/10/2012 1
  2. 2. 2
  3. 3. 3
  4. 4. Problem!• Which exists in the archive? – Probably the unauthenticated version, right?• What factors created “my” representation? – Can I archive “my” representation?• Am I seeing undead resources? – Mix of live and archived content?• How can we capture, share, and archive user experiences? 4
  5. 5. Which version is in the Internet Archive? 5
  6. 6. Which version is in WebCite? 6
  7. 7. Craigslist.org$ curl -I -L http://www.craigslist.orgHTTP/1.1 302 FoundSet-Cookie: …Location: http://geo.craigslist.org/HTTP/1.1 302 FoundContent-Type: text/html; charset=iso-8859-1Connection: closeLocation: http://norfolk.craigslist.orgDate: Thu, 31 May 2012 23:26:27 GMTSet-Cookie: …Server: ApacheHTTP/1.1 200 OKConnection: closeCache-Control: max-age=3600, publicLast-Modified: Thu, 31 May 2012 23:13:46 GMTSet-Cookie: …Transfer-Encoding: chunkedDate: Thu, 31 May 2012 23:13:46 GMTVary: Accept-EncodingContent-Type: text/html; charset=iso-8859-1;X-Frame-Options: Allow-From https://forums.craigslist.orgServer: Apache 7Expires: Fri, 01 Jun 2012 00:13:46 GMT
  8. 8. Live ResourceAccessed from Norfolk 8
  9. 9. Archived Resource Submitted from Norfolk• Submitted to WebCite from Norfolk 9
  10. 10. Live Norfolk Interactive Mapper 10http://gisapp2.norfolk.gov/interactive_mapper/viewer.htm
  11. 11. Archived Norfolk Interactive Mapper 11http://web.archive.org/web/20100924020604/http://gisapp2.norfolk.gov/interactive_mapper/viewer.htm
  12. 12. Web 2.0• Crawlers aren’t enough• Relies on interaction/personalization• Users may want to archive personal content• How do we capture user experiences? – Justin’s vs. Dr. Nelson’s experience? Both?• What about sharing browsing sessions? 12
  13. 13. Things are better (but really worse)• Better UI, worse archiving• HTML5• JavaScript – document.write• Cookies• User Interaction• GeoIP 13
  14. 14. Traditional Representation generation Dereference URI Resource Identifies Represents RepresentationFrom W3C Web Architecture 14
  15. 15. Representation through content negotiation Dereference Negotiate URI Resource Identifies Represents RepresentationFrom W3C Web Architecture 15
  16. 16. Web 2.0 RepresentationGeneration Dereference UserURI Interaction Client- side Resource script Identifies Represents Representation 16
  17. 17. Prior Work• Capture for Debugging and Security – Mickens, 2010; Livshits, 2007, 2009, 2010; Dhawan, 2009• Crawlers – Mesbah, 2008; Duda, 2008; Lowet, 2009• Caching dynamic content – Benson, 2010; Karri, 2009; Boulos, 2010; Periyapatna, 2009; Sivasubramanian, 2007• Walden’s paths – http://www.csdl.tamu.edu/walden/• IIPC Workshop 2012: Archiving the Future Web – http://blog.dshr.org/2012/05/harvesting-and-preserving-future-web.html 17
  18. 18. Two Current Solutions• Browser-based crawling – Problematic at scale, misses post-render content, no session spanning, misses “personal” browsing – IA – To be released – Heritrix 3.X• Transactional Web Archiving – Impact/depth is unknown, client-side changes are missed, must have server/content author buy-in – LANL – http://theresourcedepot.org/ 18
  19. 19. What can Justin do about it?• How can we capture THE user experience? – How much user-shared content is archivable? – What defines a dynamic representation? • Infinitely Changing? – How much dynamic content are archives missing? – What tools are required to capture the representation? • Browser Add-on? – How much will users contribute to the archives?• Is this even possible? 19
  20. 20. Characteristics of a Potential Solution• Browser Add-on• Crowd sourced – User contributions to archives• Opt-in representation archiving/sharing• Capture client-side DOM – JS, HTML, representation, etc.• Capture client-side events and resulting DOM – Includes Ajax and post-render data• Package and submit to archives 20
  21. 21. 21
  22. 22. Dissertation Plan BEGIN Background Research Coursework Quals Prevalence of Current Unarchivable Resources State Define test datasets (set of dynamic and static test pages) Define factors/equations of dynamic representations – What dynamic content can (and cannot) be captured for archiving? Construction of software solution -- VCR for the Web: Record, Rewind, Replay Analysis of improved capture -- Client-side Archiving: Client-side (human assisted) Capture vs. Traditional Crawlers vs. Headless clients Explore how personalized archives can be combined with public web archivesPhD Defense
  23. 23. Current Work: How much can we archive?• Sample from Bit.ly URIs from Twitter• Load page in each environment: – Live – 3rd Party Archived • Submit and load from WebCitation – Locally stored • wget –k -p and load from local drive – Local only • Load from local drive – no Internet access 23
  24. 24. Livehttp://dctheatrescene.com/2009/06/11/toxic-avengers-john-rando/ 24
  25. 25. Archived (WebCite)http://www.webcitation.org/685EYfYEK 25
  26. 26. Locally Storedhttp://localhost/dctheatrescene.com/2009/06/11/toxic-avengers-john-rando/ 26
  27. 27. Local Only (No Internet) http://localhost/dctheatrescene.com/2009/06/11/toxic-avengers-john-rando/• Missing: 12/78 without internet• dctheatrescene.com/…/uperfish.args.js?e83a2c• dctheatrescene.com/…/css/datatables.css? ver=1.9.3• Small files, bit impact 27
  28. 28. Thought Experiment 28
  29. 29. Double Click 4x 29
  30. 30. Click and drag to left 30
  31. 31. Submit to Archive 31
  32. 32. Future Research Questions• What dynamism can (and cannot) be captured for archiving?• Client-side Archiving: Client-side Capture vs. Traditional Crawlers• Client-side contributions to Web Archives: Archiving User Experiences 32
  33. 33. Conclusion• Is dynamic content archivable?• How much are we missing?• Can you archive your experience? • For the betterment of archives • For personal capture 33
  34. 34. Backups 34
  35. 35. References• J. Mickens, J. Elson, and J. Howell. Mugshot: deterministic capture and replay for JavaScript applications. In Proceedings of the 7th USENIX conference on Networked systems design and implementation, NSDI10, pages 11-11, Berkeley, CA, USA, 2010. USENIX Association.• K.Vikram, A. Prateek, and B. Livshits. Ripley: Automatically securing web 2.0 applications through replicated execution. In Proceedings of the Conference on Computer and Communications Security, November 2009.• E. Kiciman and B. Livshits. Ajaxscope: A platform for remotely monitoring the client-side behavior of web 2.0 applications. In the 21st ACM Symposium on Operating Systems Principles (SOSP07), SOSP 07, 2007.• B. Livshits and S. Guarnieri. Gulfstream: Incremental static analysis for streaming JavaScript applications. Technical Report MSR-TR-2010-4, Microsoft, January 2010.• M. Dhawan and V. Ganapathy. Analyzing information flow in JavaScript-based browser extensions. Annual Computer Security Applications Conference, pages 382 - 391, 2009.• A. Mesbah, E. Bozdag, and A. van Deursen. Crawling Ajax by inferring user interface state changes. In Web Engineering, 2008. ICWE 08. Eighth International Conference on, pages 122-134, July 2008.• C. Duda, G. Frey, D. Kossmann, and C. Zhou. AjaxSearch: crawling, indexing and searching Web 2.0 applications. Proc. VLDB Endow., 1:1440-1443, August 2008. 35• D. Lowet and D. Goergen. Co-browsing dynamic web pages. In WWW, pages 941-950,
  36. 36. References• S. Chakrabarti, S. Srivastava, M. Subramanyam, and M. Tiwari. Memex: A browsing assistant for collaborative archiving and mining of surf trails. In Proceedings of the 26th VLDB Conference, 26th VLDB, 2000.• R. Karri. Client-side page element web-caching, 2009.• E. Benson, A. M. 0002, D. R. Karger, and S. Madden. Sync kit: a persistent client-side database caching toolkit for data intensive websites. In WWW, pages 121{130, 2010.• M. N. K. Boulos, J. Gong, P. Yue, and J. Y. Warren. Web gis in practice viii: Html5 and the canvas element for interactive online mapping. International journal of health geographics, March 2010.• S. Periyapatna. Total recall for Ajax applications firefox extension, 2009.• S. Sivasubramanian, G. Pierre, M. van Steen, and G. Alonso. Analysis of caching and replication strategies for web applications. IEEE Internet Computing, 11:60-66, 2007. 36
  37. 37. Web Archives• “Web archiving is the process of collecting portions of the World Wide Web and ensuring the collection is preserved … for future researchers, historians, and the public. “ -- http:// en.wikipedia.org/wiki/Web_archiving 37
  38. 38. What does this have to do with DLs?• Improved coverage• NARA regulation• Improved “memory”• Gathers missing User Experiences – Or at least an adequate sub-sample 38
  39. 39. Envisioned Solution SELECT PREVIOUS REPRESENTATION TO ARCHIVE:User Event: User Event: User Event: Text Entered Double Click Text Entered Button PushAjax Event: Ajax Event: Ajax Event: XMLResponse XMLResponse XMLResponse 39
  40. 40. Google Maps 40
  41. 41. Current Web Applications 41
  42. 42. Web Applications with Session Archiver 42