SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Impact of HTTP Cookie Violations
in Web Archives
Sawood Alam, Michele C. Weigle, and Michael L. Nelson
Old Dominion University, Norfolk, VA, USA
@ibnesayeed @WebSciDL
Supported by NSF Grant IIS-1526700
WADL '19, June 6, 2019, Urbana-Champaign, Illinois
@ibnesayeed
Cookies Are Why Your Archived Twitter Page Is Not in English
2https://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html
@ibnesayeed
All Your Tweets Are Belong To Kannada
3
9,000+ mementos of @BarackObama
English: 53%
Kannada: 22%
Other 45 languages: 25%
https://blog.dshr.org/2018/04/all-your-tweets-are-belong-to-kannada.html
@ibnesayeed
Is JavaScript Causing This?
4
Twitter seems to be rendering translated phrases on the server.
So, JavaScript cannot be responsible.
@ibnesayeed
Is Cache Conflicting at a Shared Proxy?
5
Twitter goes to lengths (sometimes in wrong ways) in ensuring their pages are not cached.
@ibnesayeed
Is On-demand Archiving Bringing User Preferences In?
6
IA replays users’ headers in Save Page Now, but
other archives do not have on-demand archiving.
Archive.is sends custom Accept-Language
header, not the one a user’s browser sends to it.
@ibnesayeed
Is Geo-location Affecting It?
7
Most of the archival crawlers run in the USA or European regions, which does not explain why
Kannada (a regional Indian language) is so popular.
@ibnesayeed
Is Heritrix Sending Wrong Accept-Language Headers?
8
Heritrix generated WARC files do not contain any Accept-Language header.
@ibnesayeed
Language Content Negotiation in Twitter
9
The “?lang=<lang-code>” query parameter has the highest precedence.
Twitter honors Accept-Language header for content negotiation, but does not advertise it in a Vary header.
@ibnesayeed
Alternate Language Links Pollute Crawler’s Frontier Queue
10
Kannada (kn) being
at the end of the list,
causes its “lang”
cookie stick around
for long, affecting
many subsequent
Twitter URLs.
@ibnesayeed
Experiment With Heritrix On Two Seed URIs
● https://twitter.com/?lang=ar
○ First request has an explicit lang query parameter
○ First response has a “Set-Cookie: lang=ar” header
● https://twitter.com/phonedude_mln/
○ Second request has no lang query parameter, but sends a “Cookie: lang=ar”
○ Second response returns the page in Arabic
11
@ibnesayeed
Replaying Captured WARC With PyWB
12
https://twitter.com/?lang=ar https://twitter.com/phonedude_mln/
@ibnesayeed
Cookie Violations Cause Archived Twitter Pages to
Simultaneously Replay in Multiple Languages
13https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html
@ibnesayeed
Defaced Composite Mementos That Never Existed
on the Live Web
14
Live leakage (Zombies) Temporal Violations
Origin Violations
And now, Cookie Violations!
https://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html
@ibnesayeed
Anatomy of a Twitter Timeline
15
● Page is loaded with the initial set of tweets
● Navigation bar is in the current language
● Some sidebar blocks are loaded lazily
● New tweets are polled after every 30 seconds
● Global trends are polled after every 5 minutes
@ibnesayeed
Twitter Returns Server-side Rendered Markup
16
Cookies set by of prior responses may impact subsequent XHR responses.
@ibnesayeed
Pages With Explicit lang Parameter Are Consistent
17
?lang=pt
?lang=en
?lang=ur
Mementos with explicit “lang” parameter
are language consistent.
@ibnesayeed
Replicate Heritrix Behavior on the Live Web
18
Load https://twitter.com/
in a browser tab B
Retweet a tweet
in the tab A
Load https://twitter.com/?lang=en
in a browser tab A
Expand notification
in the tab B
Change lang param
in the tab A
@ibnesayeed
What Can We Do About These Cookie Violations?
● Crawling
○ Sandbox short crawl sessions
○ Explicitly enforce short cookie expiration time and garbage collect frequently
○ Identify such sources of cookie violations and filter them off
● Replay
○ Respect content negotiation headers (advertised in “Vary” header)
○ Identify non-advertised cookies that affect the content to incorporate in replay
○ Classify cookies in categories like session, tracking, and configs etc.
19
Ignoring cookies in replay causes cookie violations and has privacy concerns in personal archiving.
Blindly utilizing cookies causes false positives (hurts discovery of archived resources).
@ibnesayeed
Conclusions
● Cookies Are Why Your Archived Twitter Page Is Not in English
○ https://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html
● Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay in
Multiple Languages
○ https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html
● Identified yet another source of bias in archives (over represented languages)
● Described behavior of cookies in crawling and replay (cookie violations)
● Proposed some potential solutions like keeping cookies short-lived
● Described open problems that need more in-depth research
20

Weitere ähnliche Inhalte

Was ist angesagt?

URI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked DataURI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked Data
butest
 
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Shawn Jones
 

Was ist angesagt? (20)

The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web Archiving
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
A Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web ResourcesA Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web Resources
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple Archives
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
 
Recommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URIRecommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URI
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web Archives
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository Interoperability
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web Archives
 
Storytelling With Web Archives
Storytelling With Web ArchivesStorytelling With Web Archives
Storytelling With Web Archives
 
URI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked DataURI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked Data
 
21st Century Archival Appraisal
21st Century Archival Appraisal21st Century Archival Appraisal
21st Century Archival Appraisal
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
 

Ähnlich wie Impact of HTTP Cookie Violations in Web Archives

Supporting Account-based Queries for Archived Instagram Posts
Supporting Account-based Queries for Archived Instagram PostsSupporting Account-based Queries for Archived Instagram Posts
Supporting Account-based Queries for Archived Instagram Posts
Himarsha Jayanetti
 
Social Bookmarking Webinar
Social Bookmarking WebinarSocial Bookmarking Webinar
Social Bookmarking Webinar
Karen Brooks
 
MS PowerPoint format
MS PowerPoint formatMS PowerPoint format
MS PowerPoint format
webhostingguy
 
MS PowerPoint format
MS PowerPoint formatMS PowerPoint format
MS PowerPoint format
webhostingguy
 
Challenges in Replaying Archived Twitter Pages
Challenges in Replaying Archived Twitter PagesChallenges in Replaying Archived Twitter Pages
Challenges in Replaying Archived Twitter Pages
Kritika Garg
 

Ähnlich wie Impact of HTTP Cookie Violations in Web Archives (20)

Supporting Account-based Queries for Archived Instagram Posts
Supporting Account-based Queries for Archived Instagram PostsSupporting Account-based Queries for Archived Instagram Posts
Supporting Account-based Queries for Archived Instagram Posts
 
Feb 21, 2012 Battle Against Cryptic Web Content w/ Chris Williams of Blue Fe...
Feb 21, 2012  Battle Against Cryptic Web Content w/ Chris Williams of Blue Fe...Feb 21, 2012  Battle Against Cryptic Web Content w/ Chris Williams of Blue Fe...
Feb 21, 2012 Battle Against Cryptic Web Content w/ Chris Williams of Blue Fe...
 
Social Bookmarking Webinar
Social Bookmarking WebinarSocial Bookmarking Webinar
Social Bookmarking Webinar
 
Browser Tracking Protections - SuperWeek 2020
Browser Tracking Protections - SuperWeek 2020Browser Tracking Protections - SuperWeek 2020
Browser Tracking Protections - SuperWeek 2020
 
MS PowerPoint format
MS PowerPoint formatMS PowerPoint format
MS PowerPoint format
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
 
Web 2.0 Tools
Web 2.0 ToolsWeb 2.0 Tools
Web 2.0 Tools
 
Web 2.0 for schools
Web 2.0 for schoolsWeb 2.0 for schools
Web 2.0 for schools
 
Web performance optimization for modern web applications
Web performance optimization for modern web applicationsWeb performance optimization for modern web applications
Web performance optimization for modern web applications
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
 
Front End Oprtimization
Front End OprtimizationFront End Oprtimization
Front End Oprtimization
 
MS PowerPoint format
MS PowerPoint formatMS PowerPoint format
MS PowerPoint format
 
Web2toolsjan09
Web2toolsjan09Web2toolsjan09
Web2toolsjan09
 
Web 1.0, Web 2.0 and Digital Preservation
Web 1.0, Web 2.0 and Digital PreservationWeb 1.0, Web 2.0 and Digital Preservation
Web 1.0, Web 2.0 and Digital Preservation
 
Web 2.0 PPT
Web 2.0 PPTWeb 2.0 PPT
Web 2.0 PPT
 
The 5 most common reasons for a slow WordPress site and how to fix them
The 5 most common reasons for a slow WordPress site and how to fix themThe 5 most common reasons for a slow WordPress site and how to fix them
The 5 most common reasons for a slow WordPress site and how to fix them
 
Challenges in Replaying Archived Twitter Pages
Challenges in Replaying Archived Twitter PagesChallenges in Replaying Archived Twitter Pages
Challenges in Replaying Archived Twitter Pages
 
Blogs and Wikis: Web-based Business Collaboration Tools for the 21st Century
Blogs and Wikis:Web-based Business Collaboration Tools for the 21st CenturyBlogs and Wikis:Web-based Business Collaboration Tools for the 21st Century
Blogs and Wikis: Web-based Business Collaboration Tools for the 21st Century
 
Web 2.0 and other emerging technologies
Web 2.0 and other emerging technologiesWeb 2.0 and other emerging technologies
Web 2.0 and other emerging technologies
 
Web2toolsoctober09
Web2toolsoctober09Web2toolsoctober09
Web2toolsoctober09
 

Mehr von Sawood Alam

Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
Sawood Alam
 

Mehr von Sawood Alam (20)

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web Pages
 
CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection Insights
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento Routing
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination Framework
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research Group
 
Web Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchWeb Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext Search
 
JCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive ProfilingJCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive Profiling
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
TPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web Archives
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
 
Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015
 
Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
HTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful CommunicationHTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful Communication
 

Kürzlich hochgeladen

Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Chandigarh Call girls 9053900678 Call girls in Chandigarh
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Sheetaleventcompany
 

Kürzlich hochgeladen (20)

Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
 
Al Barsha Night Partner +0567686026 Call Girls Dubai
Al Barsha Night Partner +0567686026 Call Girls  DubaiAl Barsha Night Partner +0567686026 Call Girls  Dubai
Al Barsha Night Partner +0567686026 Call Girls Dubai
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
 
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls DubaiDubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
 

Impact of HTTP Cookie Violations in Web Archives

  • 1. Impact of HTTP Cookie Violations in Web Archives Sawood Alam, Michele C. Weigle, and Michael L. Nelson Old Dominion University, Norfolk, VA, USA @ibnesayeed @WebSciDL Supported by NSF Grant IIS-1526700 WADL '19, June 6, 2019, Urbana-Champaign, Illinois
  • 2. @ibnesayeed Cookies Are Why Your Archived Twitter Page Is Not in English 2https://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html
  • 3. @ibnesayeed All Your Tweets Are Belong To Kannada 3 9,000+ mementos of @BarackObama English: 53% Kannada: 22% Other 45 languages: 25% https://blog.dshr.org/2018/04/all-your-tweets-are-belong-to-kannada.html
  • 4. @ibnesayeed Is JavaScript Causing This? 4 Twitter seems to be rendering translated phrases on the server. So, JavaScript cannot be responsible.
  • 5. @ibnesayeed Is Cache Conflicting at a Shared Proxy? 5 Twitter goes to lengths (sometimes in wrong ways) in ensuring their pages are not cached.
  • 6. @ibnesayeed Is On-demand Archiving Bringing User Preferences In? 6 IA replays users’ headers in Save Page Now, but other archives do not have on-demand archiving. Archive.is sends custom Accept-Language header, not the one a user’s browser sends to it.
  • 7. @ibnesayeed Is Geo-location Affecting It? 7 Most of the archival crawlers run in the USA or European regions, which does not explain why Kannada (a regional Indian language) is so popular.
  • 8. @ibnesayeed Is Heritrix Sending Wrong Accept-Language Headers? 8 Heritrix generated WARC files do not contain any Accept-Language header.
  • 9. @ibnesayeed Language Content Negotiation in Twitter 9 The “?lang=<lang-code>” query parameter has the highest precedence. Twitter honors Accept-Language header for content negotiation, but does not advertise it in a Vary header.
  • 10. @ibnesayeed Alternate Language Links Pollute Crawler’s Frontier Queue 10 Kannada (kn) being at the end of the list, causes its “lang” cookie stick around for long, affecting many subsequent Twitter URLs.
  • 11. @ibnesayeed Experiment With Heritrix On Two Seed URIs ● https://twitter.com/?lang=ar ○ First request has an explicit lang query parameter ○ First response has a “Set-Cookie: lang=ar” header ● https://twitter.com/phonedude_mln/ ○ Second request has no lang query parameter, but sends a “Cookie: lang=ar” ○ Second response returns the page in Arabic 11
  • 12. @ibnesayeed Replaying Captured WARC With PyWB 12 https://twitter.com/?lang=ar https://twitter.com/phonedude_mln/
  • 13. @ibnesayeed Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay in Multiple Languages 13https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html
  • 14. @ibnesayeed Defaced Composite Mementos That Never Existed on the Live Web 14 Live leakage (Zombies) Temporal Violations Origin Violations And now, Cookie Violations! https://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html
  • 15. @ibnesayeed Anatomy of a Twitter Timeline 15 ● Page is loaded with the initial set of tweets ● Navigation bar is in the current language ● Some sidebar blocks are loaded lazily ● New tweets are polled after every 30 seconds ● Global trends are polled after every 5 minutes
  • 16. @ibnesayeed Twitter Returns Server-side Rendered Markup 16 Cookies set by of prior responses may impact subsequent XHR responses.
  • 17. @ibnesayeed Pages With Explicit lang Parameter Are Consistent 17 ?lang=pt ?lang=en ?lang=ur Mementos with explicit “lang” parameter are language consistent.
  • 18. @ibnesayeed Replicate Heritrix Behavior on the Live Web 18 Load https://twitter.com/ in a browser tab B Retweet a tweet in the tab A Load https://twitter.com/?lang=en in a browser tab A Expand notification in the tab B Change lang param in the tab A
  • 19. @ibnesayeed What Can We Do About These Cookie Violations? ● Crawling ○ Sandbox short crawl sessions ○ Explicitly enforce short cookie expiration time and garbage collect frequently ○ Identify such sources of cookie violations and filter them off ● Replay ○ Respect content negotiation headers (advertised in “Vary” header) ○ Identify non-advertised cookies that affect the content to incorporate in replay ○ Classify cookies in categories like session, tracking, and configs etc. 19 Ignoring cookies in replay causes cookie violations and has privacy concerns in personal archiving. Blindly utilizing cookies causes false positives (hurts discovery of archived resources).
  • 20. @ibnesayeed Conclusions ● Cookies Are Why Your Archived Twitter Page Is Not in English ○ https://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html ● Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay in Multiple Languages ○ https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html ● Identified yet another source of bias in archives (over represented languages) ● Described behavior of cookies in crawling and replay (cookie violations) ● Proposed some potential solutions like keeping cookies short-lived ● Described open problems that need more in-depth research 20