SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
Deep Search or Harvest: how we decided 
Simon Moffatt Ilaria Corda 
Discovery & Access – The British Library
www.bl.uk 
2 
Overview 
•Focussing on operational and functional considerations 
– Harvesting data and using Deep Search. 
•Along the way… 
–Challenges of operating with a large data set 
–Demonstrate our new Web Archive search of 1.3bn web pages 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
3 
Introductions 
•We work in a team of six supporting and developing Discovery and Access at the British Library 
•The British Library has reading rooms and storage in London and in Yorkshire. (We are based in Yorkshire) 
•We are a UK Legal Deposit library 
–We collect everything 
–It can only be accessed in our reading rooms 
•Our Ex Libris products 
–Aleph v20 Primo v4.x SFX v4.x 
–All locally hosted 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
4 
Deep Search or Harvest: how we decided 
•Like most institutions, the remit of our search is expanding 
–New digital content 
–Migration from older technologies 
•If you have a large number of new records to add, often you have two options: 
–Harvest the records into your main index 
–Deep Search to an external index 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
5 
Harvest 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
6 
Deep Search 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
7 
Two case studies 
Recently we faced this Harvest vs Deep Search dilemma for two new sources of data 
•A replacement to our Content Management System 
–we decided to harvest 
•A search of our new Web Archive 
– we implemented a deep search 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
8 
Content Management System - Harvested 
The technology behind the BL website is being updated. 
•There are around 50,000 pages 
•It has its own index, updated daily 
Work involved 
•Creating Normalisation rules and Pipe 
•Setting up daily processes 
Not appropriate for deep search – more later…. 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
9 
Large datasets in Primo: Our experience 
Background: Our data 
•13m Aleph records (Books, Journals, Maps etc) 
•5m SirsiDynix Symphony records (Sound Archive) 
•48m Journal articles (growing by 2m per year) 
•1m other records from five other pipes 
Total: 67 Million records 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 10 
Our topology 
FE1 
Front End Server 
FE2 
Front End Server 
FE3 
Front End Server 
SE1 
Search 
Slice 
SE8 
Search 
Slice 
SE7 
Search 
Slice 
SE6 
Search 
Slice 
SE5 
Search 
Slice 
SE4 
Search 
Slice 
SE3 
Search 
Slice 
SE2 
Search 
Slice 
BO 
Back Office Server 
DB 
Database Server 
Search Results 
Indexes 
Config, PNXs etc 
Config, 
Web 2.0 
etc 
Public 
Private 
Harvests 
Load-balanced requests 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
11 
Service challenges with 67m records 
Our index is ~100GB 
•Indexing takes at least 3 hours; Hotswap takes 4 hours 
–Even if there is only one new record 
–Overnight schedules are tight 
–Fast data updates are impossible 
•System restart takes 5 hours 
–Re-sync Search Schema is a whole day 
–Failover system must be available at all times 
•Primo Service Packs and Hotfixes need caution 
•Standard documented advice must be read carefully 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
12 
Development challenges with 67m records 
•Full re-harvest takes 7 weeks 
–Major normalisation changes only 3 times per year 
–Smaller UI changes can be made more often 
•Primo Version upgrades affect 13 servers (or 56 in total) 
•Implementing Primo enhancements 
–We must consider index size 
–Sort, browse etc all have an impact 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
13 
Our development cycle (not to scale) 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
14 
But there are compensations… 
•Speed of the index 
•Control over the data 
•Consistency of rules 
•Common development skills 
•A single point of failure 
These are all important 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
15 
Web Archive - Deep Search 
Web Archiving collects, makes accessible and preserves web resources of scholarly and cultural importance from the UK domain 
Some Figures: 
 ~1.3 billion documents 
 Regular crawling processes - E.g. BBC News 
And the infrastructure? 
 index size is ~3.5TB spread across 24 Solr shards 
 80-node Hadoop cluster (where crawls are processed for submission) 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
16 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
17 
Service challenges with Deep Search 
 Additional point of failure within Primo 
–Newly introduced Troubleshooting procedure 
–Service Notices for planned Downtime/Maintenance 
–Changes to the Solr Schema that could break our search 
 Primo Upgrades & Hotfixes can affect the functioning of the Deep Search 
–Re-builds of the Client in case the Deep Search libraries (jars) change 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
18 
Development challenges with Deep Search 
Significant development work to implement Primo/Solr integration 
Accommodate new Primo features (versioning control) 
E.g. Multi-select faceting (Exclusion & Inclusion) 
Needs to ensure consistency across local and non-local collections 
Seemingly NO difference from a UI/UX point of view 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
19 
But there are compensations… 
Ideal for large indexes and frequent updates 
Independent indexing processes 
Maintenance of existing scheduled processes 
Leverage existing Primo-built in features 
Existing and Extendable Solr APIs / active community 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
21 
Harvest vs Deep Search: how we decided 
ADDITIONAL INFRASTRUCTURE COSTS 
DEV TIME / SPECIALIST SKILLS 
INDEX SIZE/NO. RECORDS 
FREQUENCY & VOLUME OF UPDATES 
IMPACT ON CURRENT PROCESSES 
Deep Search 
Harvest 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
22 
Thank you

Weitere ähnliche Inhalte

Ähnlich wie Deep Search or Harvest, how we decided

Ähnlich wie Deep Search or Harvest, how we decided (20)

Using Archivemedia to preserve research data
Using Archivemedia to preserve research dataUsing Archivemedia to preserve research data
Using Archivemedia to preserve research data
 
EPrints Update, Les Carr, University of Southampton
EPrints  Update, Les Carr, University of SouthamptonEPrints  Update, Les Carr, University of Southampton
EPrints Update, Les Carr, University of Southampton
 
Conf2014_SplunkSearchOptimization
Conf2014_SplunkSearchOptimizationConf2014_SplunkSearchOptimization
Conf2014_SplunkSearchOptimization
 
SplunkGettingStartedWorkshop.pptx
SplunkGettingStartedWorkshop.pptxSplunkGettingStartedWorkshop.pptx
SplunkGettingStartedWorkshop.pptx
 
SplunkGettingStartedWorkshop.pptx
SplunkGettingStartedWorkshop.pptxSplunkGettingStartedWorkshop.pptx
SplunkGettingStartedWorkshop.pptx
 
The Good, The Bad and the Ugly
The Good, The Bad and the UglyThe Good, The Bad and the Ugly
The Good, The Bad and the Ugly
 
AD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension LibraryAD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension Library
 
Inn reach for IIUG (2005)
Inn reach for IIUG (2005)Inn reach for IIUG (2005)
Inn reach for IIUG (2005)
 
Role of-analytics-in-db as-life
Role of-analytics-in-db as-lifeRole of-analytics-in-db as-life
Role of-analytics-in-db as-life
 
Customer Presentation
Customer PresentationCustomer Presentation
Customer Presentation
 
GWAVACon - Vibe: Collaboration made easy
GWAVACon - Vibe: Collaboration made easyGWAVACon - Vibe: Collaboration made easy
GWAVACon - Vibe: Collaboration made easy
 
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
 
Kscope11 recap
Kscope11 recapKscope11 recap
Kscope11 recap
 
ASTQB washington-sept-2015
ASTQB washington-sept-2015ASTQB washington-sept-2015
ASTQB washington-sept-2015
 
PEER End of Project Report
PEER End of Project ReportPEER End of Project Report
PEER End of Project Report
 
Brief on Migration to V7 Project
Brief on Migration to V7 ProjectBrief on Migration to V7 Project
Brief on Migration to V7 Project
 
US Sugar: How Mobilizing SAP ERP Delivered Sweet Results
US Sugar: How Mobilizing SAP ERP Delivered Sweet ResultsUS Sugar: How Mobilizing SAP ERP Delivered Sweet Results
US Sugar: How Mobilizing SAP ERP Delivered Sweet Results
 
EPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster AnalyticsEPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster Analytics
 
Taking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureTaking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - Architecture
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Deep Search or Harvest, how we decided

  • 1. Deep Search or Harvest: how we decided Simon Moffatt Ilaria Corda Discovery & Access – The British Library
  • 2. www.bl.uk 2 Overview •Focussing on operational and functional considerations – Harvesting data and using Deep Search. •Along the way… –Challenges of operating with a large data set –Demonstrate our new Web Archive search of 1.3bn web pages IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 3. www.bl.uk 3 Introductions •We work in a team of six supporting and developing Discovery and Access at the British Library •The British Library has reading rooms and storage in London and in Yorkshire. (We are based in Yorkshire) •We are a UK Legal Deposit library –We collect everything –It can only be accessed in our reading rooms •Our Ex Libris products –Aleph v20 Primo v4.x SFX v4.x –All locally hosted IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 4. www.bl.uk 4 Deep Search or Harvest: how we decided •Like most institutions, the remit of our search is expanding –New digital content –Migration from older technologies •If you have a large number of new records to add, often you have two options: –Harvest the records into your main index –Deep Search to an external index IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 5. www.bl.uk 5 Harvest IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 6. www.bl.uk 6 Deep Search IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 7. www.bl.uk 7 Two case studies Recently we faced this Harvest vs Deep Search dilemma for two new sources of data •A replacement to our Content Management System –we decided to harvest •A search of our new Web Archive – we implemented a deep search IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 8. www.bl.uk 8 Content Management System - Harvested The technology behind the BL website is being updated. •There are around 50,000 pages •It has its own index, updated daily Work involved •Creating Normalisation rules and Pipe •Setting up daily processes Not appropriate for deep search – more later…. IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 9. www.bl.uk 9 Large datasets in Primo: Our experience Background: Our data •13m Aleph records (Books, Journals, Maps etc) •5m SirsiDynix Symphony records (Sound Archive) •48m Journal articles (growing by 2m per year) •1m other records from five other pipes Total: 67 Million records IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 10. www.bl.uk 10 Our topology FE1 Front End Server FE2 Front End Server FE3 Front End Server SE1 Search Slice SE8 Search Slice SE7 Search Slice SE6 Search Slice SE5 Search Slice SE4 Search Slice SE3 Search Slice SE2 Search Slice BO Back Office Server DB Database Server Search Results Indexes Config, PNXs etc Config, Web 2.0 etc Public Private Harvests Load-balanced requests IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 11. www.bl.uk 11 Service challenges with 67m records Our index is ~100GB •Indexing takes at least 3 hours; Hotswap takes 4 hours –Even if there is only one new record –Overnight schedules are tight –Fast data updates are impossible •System restart takes 5 hours –Re-sync Search Schema is a whole day –Failover system must be available at all times •Primo Service Packs and Hotfixes need caution •Standard documented advice must be read carefully IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 12. www.bl.uk 12 Development challenges with 67m records •Full re-harvest takes 7 weeks –Major normalisation changes only 3 times per year –Smaller UI changes can be made more often •Primo Version upgrades affect 13 servers (or 56 in total) •Implementing Primo enhancements –We must consider index size –Sort, browse etc all have an impact IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 13. www.bl.uk 13 Our development cycle (not to scale) IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 14. www.bl.uk 14 But there are compensations… •Speed of the index •Control over the data •Consistency of rules •Common development skills •A single point of failure These are all important IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 15. www.bl.uk 15 Web Archive - Deep Search Web Archiving collects, makes accessible and preserves web resources of scholarly and cultural importance from the UK domain Some Figures:  ~1.3 billion documents  Regular crawling processes - E.g. BBC News And the infrastructure?  index size is ~3.5TB spread across 24 Solr shards  80-node Hadoop cluster (where crawls are processed for submission) IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 16. www.bl.uk 16 IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 17. www.bl.uk 17 Service challenges with Deep Search  Additional point of failure within Primo –Newly introduced Troubleshooting procedure –Service Notices for planned Downtime/Maintenance –Changes to the Solr Schema that could break our search  Primo Upgrades & Hotfixes can affect the functioning of the Deep Search –Re-builds of the Client in case the Deep Search libraries (jars) change IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 18. www.bl.uk 18 Development challenges with Deep Search Significant development work to implement Primo/Solr integration Accommodate new Primo features (versioning control) E.g. Multi-select faceting (Exclusion & Inclusion) Needs to ensure consistency across local and non-local collections Seemingly NO difference from a UI/UX point of view IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 19. www.bl.uk 19 But there are compensations… Ideal for large indexes and frequent updates Independent indexing processes Maintenance of existing scheduled processes Leverage existing Primo-built in features Existing and Extendable Solr APIs / active community IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 20. www.bl.uk 21 Harvest vs Deep Search: how we decided ADDITIONAL INFRASTRUCTURE COSTS DEV TIME / SPECIALIST SKILLS INDEX SIZE/NO. RECORDS FREQUENCY & VOLUME OF UPDATES IMPACT ON CURRENT PROCESSES Deep Search Harvest IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]