2. www.bl.uk 2
Context
• Three collections:
– Selective since 2004
– Legal Deposit since 2013
– Historical 1996-2013 from IA
• Iterative Development:
– Work directly with researchers
– Today’s historical research tools
provide tomorrow’s reading rooms
• Using Solr to support:
– Discovery
– Preservation
– Analytics
3. www.bl.uk 3
Discovery
• Web archives tend to be messy
– Lots of poor quality content, e.g. from crawler traps.
– Spam, e.g. link spam from link farms.
– Utility of PageRank over time is unclear
• Faceted search
– Invest in developing facets to allow filtering rather than
PageRank or boosts to rank results.
– e.g. basic facets from embedded metadata:
• Last-Modified, Author, etc.
6. www.bl.uk 6
Discovery: Text features
• No stemming or lemmatization
– Researchers hated it
• Natural language detection
– e.g. gov.uk + fr
• Postcode-based geoindex
• Sentiment analysis
• Similarity hashing via ssdeep
– To detect similar texts
7. www.bl.uk 7
Discovery: Image features
• Basic properties:
– width, height, pixel count
• Face detection
– Number of faces & location
• Dominant colour extraction
– ‘Characteristic’ colours
8. www.bl.uk 8
Preservation
• Format analysis:
– Using extended MIME types (inc. version + charset):
• Served
• Apache Tika
• DROID
– First-four-bytes
– File extension
• Examples
– Understanding Unidentified Resources
14. www.bl.uk 14
Analytics
• Researcher Expectations
– “How big is the UK Web?”
• From Crawl To Web
– Crawl schedule, parameters, logs.
– "Files over 10MB are not archived”
– De-duplication handling critical
– Can't forget HTTP 30x, 40x, 50x
• Compensate via normalisation strategies
– c.f. Google Books Ngram
15. www.bl.uk 15
Technical Architecture
• Core indexer can run from CLI or Hadoop
– Makes development much easier
• Hadoop indexer has two modes:
– SolrCloud:
• Performance acceptable as long as shards map to cores
and there's good I/O (1 billion, 1 server, 1 week)
• Memory issues relating to query complexity
– Direct to HDFS:
• Really fast for moderate data volumes
• Slows down as shards grow
16. www.bl.uk 16
Scale
• 1996-2010 Tranch of the IA dataset:
– 2.5 Billion HTTP 200 URLs
• Performance issues:
– Data quality
– Robustness
– Configuration errors
• Currently re-indexing:
– with better duplicate handling
– on three dedicated servers
17. www.bl.uk 17
Open Collaboration
• Fully open source stack:
– webarchive-discovery indexer
– Begun developing an analytics UI
• Keen to collaborate
– This community faces a common problem:
• But not a core SolrCloud/ElasticSearch use case
– Danish SolrCloud on SSD discovered via Solr mailing list
• http://sbdevel.wordpress.com/2013/12/06/danish-
webscale/