IIPC GA 2014 Solr

•Download as PPTX, PDF•

0 likes•1,766 views

Andy Jackson

Technology

Large-Scale Web Archive
Discovery & Analytics Using
Apache Solr
Andrew Jackson
UK Web Archive Technical Lead

www.bl.uk 2
Context
• Three collections:
– Selective since 2004
– Legal Deposit since 2013
– Historical 1996-2013 from IA
• Iterative Development:
– Work directly with researchers
– Today’s historical research tools
provide tomorrow’s reading rooms
• Using Solr to support:
– Discovery
– Preservation
– Analytics

www.bl.uk 3
Discovery
• Web archives tend to be messy
– Lots of poor quality content, e.g. from crawler traps.
– Spam, e.g. link spam from link farms.
– Utility of PageRank over time is unclear
• Faceted search
– Invest in developing facets to allow filtering rather than
PageRank or boosts to rank results.
– e.g. basic facets from embedded metadata:
• Last-Modified, Author, etc.

www.bl.uk 4
Discovery: HTML Links
(also)

www.bl.uk 5
Discovery: Embedded Licenses

www.bl.uk 6
Discovery: Text features
• No stemming or lemmatization
– Researchers hated it
• Natural language detection
– e.g. gov.uk + fr
• Postcode-based geoindex
• Sentiment analysis
• Similarity hashing via ssdeep
– To detect similar texts

www.bl.uk 7
Discovery: Image features
• Basic properties:
– width, height, pixel count
• Face detection
– Number of faces & location
• Dominant colour extraction
– ‘Characteristic’ colours

www.bl.uk 8
Preservation
• Format analysis:
– Using extended MIME types (inc. version + charset):
• Served
• Apache Tika
• DROID
– First-four-bytes
– File extension
• Examples
– Understanding Unidentified Resources

www.bl.uk 10
Preservation
• Deeper characterisation
– Software identifiers
– (X)HTML: Elements Used
– XML: Root Namespace
– PDF: Apache Preflight
– Apache Tika's parse errors
– Will consider adding:
• DRMLint (SCAPE)
• JHOVE

www.bl.uk 14
Analytics
• Researcher Expectations
– “How big is the UK Web?”
• From Crawl To Web
– Crawl schedule, parameters, logs.
– "Files over 10MB are not archived”
– De-duplication handling critical
– Can't forget HTTP 30x, 40x, 50x
• Compensate via normalisation strategies
– c.f. Google Books Ngram

www.bl.uk 15
Technical Architecture
• Core indexer can run from CLI or Hadoop
– Makes development much easier
• Hadoop indexer has two modes:
– SolrCloud:
• Performance acceptable as long as shards map to cores
and there's good I/O (1 billion, 1 server, 1 week)
• Memory issues relating to query complexity
– Direct to HDFS:
• Really fast for moderate data volumes
• Slows down as shards grow

www.bl.uk 16
Scale
• 1996-2010 Tranch of the IA dataset:
– 2.5 Billion HTTP 200 URLs
• Performance issues:
– Data quality
– Robustness
– Configuration errors
• Currently re-indexing:
– with better duplicate handling
– on three dedicated servers

www.bl.uk 17
Open Collaboration
• Fully open source stack:
– webarchive-discovery indexer
– Begun developing an analytics UI
• Keen to collaborate
– This community faces a common problem:
• But not a core SolrCloud/ElasticSearch use case
– Danish SolrCloud on SSD discovered via Solr mailing list
• http://sbdevel.wordpress.com/2013/12/06/danish-
webscale/

Viewers also liked

Considerations for Strategic Web Archive Collection Developmentnullhandle

Seo strategyPratap Singh

Pvg finish 01Maria Smirnova

Seeing In The Dark: Discovery and data-mining of restricted web archivesAndy Jackson

The 'Digital Object Types' IssueAndy Jackson

Pratap singhPratap Singh

Unified characterisation, pleaseAndy Jackson

3 perkataanSek Keb Taman Rinting 2

Introduction to Apache SolrAndy Jackson

Ten years of the UK web archive: what have we saved?Andy Jackson

PantunSek Keb Taman Rinting 2

Mercu Tanda Malaysia skrapSek Keb Taman Rinting 2

Kertas Kerja Lawatan PrasekolahSek Keb Taman Rinting 2

Prezentace zcu 2-1Anna Vyčítalová

Vycitalova a ifla_camp1Anna Vyčítalová

Presentatie hotels bSamir_Bekaert

Escultura griegaOscarMoralGarachana

Quest antharasWilson Martines Filho

catalogoantonio765

Tema 10 2 funciónsXerardo Méndez Álvarez

Viewers also liked (20)

Considerations for Strategic Web Archive Collection Development

Seo strategy

Pvg finish 01

Seeing In The Dark: Discovery and data-mining of restricted web archives

The 'Digital Object Types' Issue

Pratap singh

Unified characterisation, please

3 perkataan

Introduction to Apache Solr

Ten years of the UK web archive: what have we saved?

Pantun

Mercu Tanda Malaysia skrap

Kertas Kerja Lawatan Prasekolah

Prezentace zcu 2-1

Vycitalova a ifla_camp1

Presentatie hotels b

Escultura griega

Quest antharas

catalogo

Tema 10 2 funcións

Similar to IIPC GA 2014 Solr

Internet content as research dataNational Library of Australia

Slides anu talkwebarchivingaug2012Roxanne Missingham

Scalability andefficiencypresNekoGato

Digging into the Web Archive at the British Library 2014-11-27Andy Jackson

Frontera: open source, large scale web crawling frameworkScrapinghub

TPDL 2016 Doctoral Consortium - Web Archive ProfilingSawood Alam

Share point 2013 enterprise search (public)Petter Skodvin-Hvammen

Web archiving challenges and opportunitiesAhmed AlSum

More Archives, More Better Michael Nelson

Browser-Based Digital PreservationMat Kelly

JCDL 2016 Doctoral Consortium - Web Archive ProfilingSawood Alam

Common crawlpresentationHadoop User Group

Building a Scalable Web Crawler with HadoopHadoop User Group

Apache drillMapR Technologies

Frontera-Open Source Large Scale Web Crawling Frameworksixtyone

Online Collections Crawlability for Libraries, Archives, and Museumsmherbison

Practical Machine Learning for Smarter Search with Solr and SparkJake Mannix

Practical Machine Learning for Smarter Search with Spark+SolrJake Mannix

Pieper NISO Virtual Conf Feb17National Information Standards Organization (NISO)

High and Lows of Library Linked DataAdrian Stevenson

Similar to IIPC GA 2014 Solr (20)

Internet content as research data

Slides anu talkwebarchivingaug2012

Scalability andefficiencypres

Digging into the Web Archive at the British Library 2014-11-27

Frontera: open source, large scale web crawling framework

TPDL 2016 Doctoral Consortium - Web Archive Profiling

Share point 2013 enterprise search (public)

Web archiving challenges and opportunities

More Archives, More Better

Browser-Based Digital Preservation

JCDL 2016 Doctoral Consortium - Web Archive Profiling

Common crawlpresentation

Building a Scalable Web Crawler with Hadoop

Apache drill

Frontera-Open Source Large Scale Web Crawling Framework

Online Collections Crawlability for Libraries, Archives, and Museums

Practical Machine Learning for Smarter Search with Solr and Spark

Practical Machine Learning for Smarter Search with Spark+Solr

Pieper NISO Virtual Conf Feb17

High and Lows of Library Linked Data

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

FWD Group - Insurer Innovation Award 2024The Digital Insurer

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

DBX First Quarter 2024 Investor PresentationDropbox

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Real Time Object Detection Using Open CVKhem

Ransomware_Q4_2023. The report. [EN].pdfOverkill Security

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

MS Copilot expands with MS Graph connectorsNanddeep Nachan

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

FWD Group - Insurer Innovation Award 2024

AWS Community Day CPH - Three problems of Terraform

DBX First Quarter 2024 Investor Presentation

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Real Time Object Detection Using Open CV

Ransomware_Q4_2023. The report. [EN].pdf

Powerful Google developer tools for immediate impact! (2023-24 C)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Exploring the Future Potential of AI-Enabled Smartphone Processors

MS Copilot expands with MS Graph connectors

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Artificial Intelligence Chap.5 : Uncertainty

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

IIPC GA 2014 Solr

1. Large-Scale Web Archive Discovery & Analytics Using Apache Solr Andrew Jackson UK Web Archive Technical Lead

2. www.bl.uk 2 Context • Three collections: – Selective since 2004 – Legal Deposit since 2013 – Historical 1996-2013 from IA • Iterative Development: – Work directly with researchers – Today’s historical research tools provide tomorrow’s reading rooms • Using Solr to support: – Discovery – Preservation – Analytics

3. www.bl.uk 3 Discovery • Web archives tend to be messy – Lots of poor quality content, e.g. from crawler traps. – Spam, e.g. link spam from link farms. – Utility of PageRank over time is unclear • Faceted search – Invest in developing facets to allow filtering rather than PageRank or boosts to rank results. – e.g. basic facets from embedded metadata: • Last-Modified, Author, etc.

4. www.bl.uk 4 Discovery: HTML Links (also)

5. www.bl.uk 5 Discovery: Embedded Licenses

6. www.bl.uk 6 Discovery: Text features • No stemming or lemmatization – Researchers hated it • Natural language detection – e.g. gov.uk + fr • Postcode-based geoindex • Sentiment analysis • Similarity hashing via ssdeep – To detect similar texts

7. www.bl.uk 7 Discovery: Image features • Basic properties: – width, height, pixel count • Face detection – Number of faces & location • Dominant colour extraction – ‘Characteristic’ colours

8. www.bl.uk 8 Preservation • Format analysis: – Using extended MIME types (inc. version + charset): • Served • Apache Tika • DROID – First-four-bytes – File extension • Examples – Understanding Unidentified Resources

9. www.bl.uk 9 HTML Versions Over Time

10. www.bl.uk 10 Preservation • Deeper characterisation – Software identifiers – (X)HTML: Elements Used – XML: Root Namespace – PDF: Apache Preflight – Apache Tika's parse errors – Will consider adding: • DRMLint (SCAPE) • JHOVE

11. www.bl.uk 11 Elements Over Time

12. www.bl.uk 12 PDF/A Validation Errors

13. www.bl.uk 13 Parse Errors

14. www.bl.uk 14 Analytics • Researcher Expectations – “How big is the UK Web?” • From Crawl To Web – Crawl schedule, parameters, logs. – "Files over 10MB are not archived” – De-duplication handling critical – Can't forget HTTP 30x, 40x, 50x • Compensate via normalisation strategies – c.f. Google Books Ngram

15. www.bl.uk 15 Technical Architecture • Core indexer can run from CLI or Hadoop – Makes development much easier • Hadoop indexer has two modes: – SolrCloud: • Performance acceptable as long as shards map to cores and there's good I/O (1 billion, 1 server, 1 week) • Memory issues relating to query complexity – Direct to HDFS: • Really fast for moderate data volumes • Slows down as shards grow

16. www.bl.uk 16 Scale • 1996-2010 Tranch of the IA dataset: – 2.5 Billion HTTP 200 URLs • Performance issues: – Data quality – Robustness – Configuration errors • Currently re-indexing: – with better duplicate handling – on three dedicated servers

17. www.bl.uk 17 Open Collaboration • Fully open source stack: – webarchive-discovery indexer – Begun developing an analytics UI • Keen to collaborate – This community faces a common problem: • But not a core SolrCloud/ElasticSearch use case – Danish SolrCloud on SSD discovered via Solr mailing list • http://sbdevel.wordpress.com/2013/12/06/danish- webscale/

18. www.bl.uk 18 Thank you

Editor's Notes

Indexing with SOLR issues and best practices

IIPC GA 2014 Solr

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to IIPC GA 2014 Solr

Similar to IIPC GA 2014 Solr (20)

Recently uploaded

Recently uploaded (20)

IIPC GA 2014 Solr

Editor's Notes