The document discusses designing websites to be preservable by web archivists. It provides tips such as using durable data formats, stable URLs, metadata embedding, and following web standards to help archiving technologies fully capture and replay the site. The document recommends seeing how a site validates, looks when archived, and generating sitemaps as ways to check if it meets priorities of being fully capturable, having its experience replayable over time, and remaining coherent as archives are preserved.
1. Designing Preservable
Websites
Nicholas Taylor
@nullhandle
DC, VA & MD Search Engine Marketing Meetup
July 18, 2012
“found glass” by Flickr user nuanc under CC BY-NC-ND 2.0
4. search engine crawler ≠
archival crawler
“GoogleBots” by Flickr user ares64 under CC BY 2.0
5. what is a “preservable”
website?
“Fish Preserver” by Flickr user ecstaticist under CC BY-NC-SA 2.0
6. three priorities:
• capture: can resources be acquired by
current web archiving technologies?
• replay: can the user’s experience of
the original website be recreated from
the archived resources?
• preservation: how can it be assured
that the archived website remains
coherent over time?
7. follow web standards and
accessibility guidelines
“Web Standards Fortune Cookie” by Flickr user mherzber under CC BY-SA 2.0
15. three tips
1. see how well your site
validates on
http://validator.w3.org/
2. see how your site looks
on http://archive.org/
3. your favorite online
sitemap generator is a
good starting point
“Highlighters” by Flickr user KJGarbutt under CC BY-ND 2.0
Design decisions have a major effect on website preservability.
“ Benign neglect” may have been sufficient for physical objects; more active interventions needed for digital ones.
Design and usage inform each other; where does web archiving fit?
Because web archivists care about recreating the user experience, they care about all assets being crawled.
Good also for usability and SEO. Web crawlers access sites like a text browser. Replay platform must accommodate coding idiosyncrasies.
CSS and JavaScript directories matter for archiving but perhaps not for search engine indexing.
Crawler can only capture links it sees. User of archived site can only navigate by following links. Avoid relying on Flash, JavaScript, or other technologies that obscure links. Use a site map.
Link rot is common. Web archiving tools are URL-sensitive. Stable/redirect URLs make for seamless archive access.
Copyright law lacks explicit provisions for digital preservation. Many libraries ask for permission to archive websites. Creative Commons provides affirmative permission to be crawled and preserved.
Websites contain many different file types, each with distinct preservation risks. Favor open standards and file formats, except when poorly-documented or where vendor-specific extensions are allowed.
Embedded metadata makes it easier to replay and preserve archived sites.
Platform providers more likely to accommodate commercial search indexers than archival crawlers. If you care about archiving, inquire about policies, examine robots.txt, or look at how website looks in Internet Archive’s Wayback Machine. If you’re using an open source CMS, be sure to review the bundled robots.txt.
While following these recommendations won’t guarantee perfect archiving, not following them will ensure additional challenges.