Certain HTTP Cookies on certain sites can be a source of content bias in archival crawls. Accommodating Cookies at crawl time, but not utilizing them at replay time may cause cookie violations, resulting in defaced composite mementos that never existed on the live web. To address these issues, we propose that crawlers store Cookies with short expiration time and archival replay systems account for values in the Vary header along with URIs.
1. Impact of HTTP Cookie Violations
in Web Archives
Sawood Alam, Michele C. Weigle, and Michael L. Nelson
Old Dominion University, Norfolk, VA, USA
@ibnesayeed @WebSciDL
Supported by NSF Grant IIS-1526700
WADL '19, June 6, 2019, Urbana-Champaign, Illinois
2. @ibnesayeed
Cookies Are Why Your Archived Twitter Page Is Not in English
2https://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html
3. @ibnesayeed
All Your Tweets Are Belong To Kannada
3
9,000+ mementos of @BarackObama
English: 53%
Kannada: 22%
Other 45 languages: 25%
https://blog.dshr.org/2018/04/all-your-tweets-are-belong-to-kannada.html
4. @ibnesayeed
Is JavaScript Causing This?
4
Twitter seems to be rendering translated phrases on the server.
So, JavaScript cannot be responsible.
5. @ibnesayeed
Is Cache Conflicting at a Shared Proxy?
5
Twitter goes to lengths (sometimes in wrong ways) in ensuring their pages are not cached.
6. @ibnesayeed
Is On-demand Archiving Bringing User Preferences In?
6
IA replays users’ headers in Save Page Now, but
other archives do not have on-demand archiving.
Archive.is sends custom Accept-Language
header, not the one a user’s browser sends to it.
7. @ibnesayeed
Is Geo-location Affecting It?
7
Most of the archival crawlers run in the USA or European regions, which does not explain why
Kannada (a regional Indian language) is so popular.
8. @ibnesayeed
Is Heritrix Sending Wrong Accept-Language Headers?
8
Heritrix generated WARC files do not contain any Accept-Language header.
9. @ibnesayeed
Language Content Negotiation in Twitter
9
The “?lang=<lang-code>” query parameter has the highest precedence.
Twitter honors Accept-Language header for content negotiation, but does not advertise it in a Vary header.
10. @ibnesayeed
Alternate Language Links Pollute Crawler’s Frontier Queue
10
Kannada (kn) being
at the end of the list,
causes its “lang”
cookie stick around
for long, affecting
many subsequent
Twitter URLs.
11. @ibnesayeed
Experiment With Heritrix On Two Seed URIs
● https://twitter.com/?lang=ar
○ First request has an explicit lang query parameter
○ First response has a “Set-Cookie: lang=ar” header
● https://twitter.com/phonedude_mln/
○ Second request has no lang query parameter, but sends a “Cookie: lang=ar”
○ Second response returns the page in Arabic
11
13. @ibnesayeed
Cookie Violations Cause Archived Twitter Pages to
Simultaneously Replay in Multiple Languages
13https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html
14. @ibnesayeed
Defaced Composite Mementos That Never Existed
on the Live Web
14
Live leakage (Zombies) Temporal Violations
Origin Violations
And now, Cookie Violations!
https://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html
15. @ibnesayeed
Anatomy of a Twitter Timeline
15
● Page is loaded with the initial set of tweets
● Navigation bar is in the current language
● Some sidebar blocks are loaded lazily
● New tweets are polled after every 30 seconds
● Global trends are polled after every 5 minutes
17. @ibnesayeed
Pages With Explicit lang Parameter Are Consistent
17
?lang=pt
?lang=en
?lang=ur
Mementos with explicit “lang” parameter
are language consistent.
18. @ibnesayeed
Replicate Heritrix Behavior on the Live Web
18
Load https://twitter.com/
in a browser tab B
Retweet a tweet
in the tab A
Load https://twitter.com/?lang=en
in a browser tab A
Expand notification
in the tab B
Change lang param
in the tab A
19. @ibnesayeed
What Can We Do About These Cookie Violations?
● Crawling
○ Sandbox short crawl sessions
○ Explicitly enforce short cookie expiration time and garbage collect frequently
○ Identify such sources of cookie violations and filter them off
● Replay
○ Respect content negotiation headers (advertised in “Vary” header)
○ Identify non-advertised cookies that affect the content to incorporate in replay
○ Classify cookies in categories like session, tracking, and configs etc.
19
Ignoring cookies in replay causes cookie violations and has privacy concerns in personal archiving.
Blindly utilizing cookies causes false positives (hurts discovery of archived resources).
20. @ibnesayeed
Conclusions
● Cookies Are Why Your Archived Twitter Page Is Not in English
○ https://ws-dl.blogspot.com/2018/03/2018-03-21-cookies-are-why-your.html
● Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay in
Multiple Languages
○ https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html
● Identified yet another source of bias in archives (over represented languages)
● Described behavior of cookies in crawling and replay (cookie violations)
● Proposed some potential solutions like keeping cookies short-lived
● Described open problems that need more in-depth research
20