SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Introduction to Apache Solr
Andrew Jackson
UK Web Archive Technical Lead
www.bl.uk 2
Web Archive Overall Architecture
www.bl.uk 3
Understanding Your Use Case(s)
• Full text search, right?
– Yes, but there are many variations and choices to make.
• Work with users to understand their information needs:
– Are they looking for…
• Particular (archived) web resources?
• Resources on a particular issue or subject?
• Evidence of trends over time?
– What aspects of the content do they consider important?
– What kind of outputs do they want?
www.bl.uk 4
Working With Historians…
• JISC AADDA Project:
– Initial index and UI of the 1996-2010 data
– Great learning experience and feedback
– http://domaindarkarchive.blogspot.co.uk/
• AHRC ‘Big Data’ Project:
– Second iteration of index and UI
– Bursary holders reports coming soon
– http://buddah.projects.history.ac.uk/
• Interested in trends and reflections of society
– Who links to who/what, over time?
www.bl.uk 5
Apache Solr & Lucene
• Apache Lucene:
– A Java library for full text indexes
• Apache Solr:
– A web service and API that exposes Lucene functionality in a
as a document database
– Supports SolrCloud mode for distributed searches
• See also:
– Elasticsearch (also built around Lucene)
– We ‘chose’ Solr before Elasticsearch existed
– http://solr-vs-elasticsearch.com/
www.bl.uk 6
Example: Indexing Quotes
• Quotes to be indexed:
– “To do is to be.” - Jean-Paul Sartre
– “To be is to do.” - Socrates
– “Do be do be do.” - Frank Sinatra
• Goals:
– Index the quotation for full-text search.
• e.g. Show me all quotes that contain “to be”.
– Index the author for faceted search.
• e.g. Show me all quotes by “Frank Sinatra”.
www.bl.uk 7
Lucene’s Inverted Indexes
www.bl.uk 8
Solr as a Document Database
• Solr Indexes/Stores & Retrieves:
– Documents
composed of:
• Multiple Fields
each of which has a defined:
– Field Type
such as ‘text’, ‘string’, ‘int’, etc.
• The queries you can support depend on on many
parameters, but the fields and their types are the most
critical factors.
– See Overview of Documents, Fields, and Schema Design
www.bl.uk 9
The Quotes As Solr Documents
• Our Documents contain three fields:
– ‘id’ field of type ‘string’
– ‘text’ field of type ‘text_general’
– ‘author’ field, of type ‘string’
• Example Documents:
– id: “1”, text: “To do is to be.”, author: “Jean-Paul Sartre”
– id: “2”, text: “To be is to do.”, author: “Socrates”
– id: “3”, text: “Do be do be do.”, author: “Frank Sinatra”
www.bl.uk 10
Solr Update Flow
www.bl.uk 11
Analyzing The Text Field
• Analyzing the text on document 1:
– Input: “To do is to be.”, type = ‘text_general’
– Standard Tokeniser:
• ‘To’ ‘be’ ‘is’ ‘to’ ‘do’
– Lower Case Filter:
• ‘to’ ‘be’ ‘is’ ‘to’ ‘do’
• Adding the tokens to the index:
– ‘be’ => id:1
– ‘do’ => id:1
– …
www.bl.uk 12
Analyzing The Author Field
• Analyzing the author on document 1:
– Input: “Jean-Paul Sartre”, type = ‘string’
– Strings are stored as is.
• Adding the tokens to the index:
– ‘Jean-Paul Sartre’ => id:1
www.bl.uk 13
Solr Query Flow
www.bl.uk 14
Query for text:“To be”
• Uses the same analyser
as the indexer:
– “To be?”
– ST: “To” “be”
– LCF: “to” “be”
• Returns
documents:
– 1
– 2
www.bl.uk 15
Solr’s Built-in UI
www.bl.uk 16
Solr Overall Flow
www.bl.uk 17
Choice: Ignore ‘stop words’?
• Removes common words, unrelated to subject/topic
– Input: “To do is to be”
– Standard Tokeniser:
• ‘To’ ‘be’ ‘is’ ‘to’ ‘do’
– Stop Words Filter (stopwords_en.txt):
• ‘do’
– Lower Case Filter:
• ‘do’
• Cannot support phrase search
– e.g. searching for “to be”
www.bl.uk 18
Choice: Stemming?
• Attempts to group concepts together:
– "fishing", "fished”, "fisher" => "fish"
– "argue", "argued", "argues", "arguing”, "argus” => "argu"
• Sometimes confused:
– "axes” => "axe”, or ”axis”?
• Better at grouping related items together
• Makes precise phrase searching difficult
www.bl.uk 19
So Many Choices…
• Lots of text indexing options to tune:
– Punctuation and tokenization:
• is www.google.com one or three tokens?
– Stop word filter (“the” => “”)
– Lower case filter (“This” => “this”)
– Stemming (choice of algorithms too)
– Keywords (excepted from stemming)
– Synonyms (“TV” => “Television”)
– Possessive Filter (“Blair’s” => “Blair”)
– …and many more Tokenizers and Filters.
www.bl.uk 20
Even More Choices: Query Features
• As well as full-text search variations, we have
– Query parsers and features:
• Proximity, wildcards, term frequencies, relevance…
– Faceted search
– Numeric or Date values and range queries
– Geographic data and spatial search
– Snippets/fragments and highlighting
– Spell checking i.e. ‘Did you mean …?’
– MoreLikeThis
– Clustering
www.bl.uk 21
How to get started?
• Experimenting with the UKWA stack:
– Indexing:
• webarchive-discovery
– User Interfaces:
• Drupal Sarnia
• Shine (Play Framework, by UKWA)
• See https://github.com/ukwa/webarchive-
discovery/wiki/Front-ends
www.bl.uk 22
The webarchive-discovery system
• The webarchive-discovery codebase is an indexing stack
that reflects our (UKWA) use cases
– Contains our choices, reflects our progress so far
– Turns ARC or WARC records into Solr Documents
– Highly robust against (W)ARC data quality problems
• Adds custom fields for web archiving
– Text extracted using Apache Tika
– Various other analysis features
• Workshop sessions will use our setup
– but this is only a starting point…
www.bl.uk 23
Features: Basic Metadata Fields
• From the file system:
– The source (W)ARC filename and offset
• From the WARC record:
– URL, host, domain, public suffix
– Crawl date(s)
• From the HTTP headers:
– Content length
– Content type (as served)
– Server software IDs
www.bl.uk 24
Features: Payload Analysis
• Binary hash, embedded metadata
• Format and preservation risk analysis:
– Apache Tika & DROID format and encoding ID
– Notes parse errors to spot access problems
– Apache Preflight PDF risk analysis
– XML root namespace
– Format signature generation tricks
• HTML links, elements used, licence/rights URL
• Image properties, dominant colours, face detection
www.bl.uk 25
Features: Text Analysis
• Text extraction from binary formats
• ‘Fuzzy’ hash (ssdeep) of text
– for similarity analysis
• Natural language detection
• UK postcode extraction and geo-indexing
• Experimental language analysis:
– Simplistic sentiment analysis
– Stanford NLP named entity extraction
– Initial GATE NLP analyser
www.bl.uk 26
Command-line Indexing Architecture
www.bl.uk 27
Hadoop Indexing Architecture
www.bl.uk 28
Scaling Solr
• We are operating outside Solr’s sweet spot:
– General recommendation is RAM = Index Size
– We have a 15TB index. That’s a lot of RAM.
• e.g. from this email
– “100 million documents [and 16-32GB] per node”
– “it's quite the fool's errand for average developers to try to
replicate the "heroic efforts" of the few.”
• So how to scale up?
www.bl.uk 29
Basic Index Performance Scaling
• One Query:
– Single-threaded binary search
– Seek-and-read speed is critical, not CPU
• Add RAID/SAN?
– More IOPS can support more concurrent queries
– BUT each query is no faster
• Want faster queries?
– Use SSD, and/or
– More RAM to cache more disk, and/or
– Split the data into more shards (on independent media)
www.bl.uk 30
Sharding & SolrCloud
• For > ~100 million documents, use shards
– More, smaller independent shards == faster search
• Shard generation:
– SolrCloud ‘Live’ shards
• We use Solr’s standard sharding
• Randomly distributes records
• Supports updates to records
– Manual sharding
• e.g. ‘static’ shards generated from files
• As used by the Danish web archive (see later today)
www.bl.uk 31
Next Steps
• Prototype, Prototype, Prototype
– Expect to re-index
– Expect to iterate your front and back end systems
– Seek real user feedback
• Benchmark, Benchmark, Benchmark
– More on scaling issues and benchmarking this afternoon
• Work Together
– Share use cases, indexing tactics
– Share system specs, benchmarks
– Share code where appropriate

Weitere ähnliche Inhalte

Was ist angesagt?

20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorialChris Huang
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBertrand Delacretaz
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solrsagar chaturvedi
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHPPaul Borgermans
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From SolrRamzi Alqrainy
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache SolrEdureka!
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache SolrBiogeeks
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy SokolenkoProvectus
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
State-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache SolrState-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache Solrguest432cd6
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and SparkLucidworks
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
 

Was ist angesagt? (20)

20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache Solr
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
State-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache SolrState-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache Solr
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
 
Solr Architecture
Solr ArchitectureSolr Architecture
Solr Architecture
 

Andere mochten auch

Introduction to Apache Solr.
Introduction to Apache Solr.Introduction to Apache Solr.
Introduction to Apache Solr.ashish0x90
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrChristos Manios
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solrtomhill
 
Introduction to SolrCloud
Introduction to SolrCloudIntroduction to SolrCloud
Introduction to SolrCloudVarun Thacker
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineTrey Grainger
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Alexandre Rafalovitch
 
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014francelabs
 
Apprendre Solr en deux heures
Apprendre Solr en deux heuresApprendre Solr en deux heures
Apprendre Solr en deux heuresSaïd Radhouani
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platformmteutelink
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...hannonhill
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01David Smiley
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache TikaPaolo Mottadelli
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutchsebastian_nagel
 

Andere mochten auch (20)

Introduction to Apache Solr.
Introduction to Apache Solr.Introduction to Apache Solr.
Introduction to Apache Solr.
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solr
 
Solr Presentation
Solr PresentationSolr Presentation
Solr Presentation
 
Introduction to SolrCloud
Introduction to SolrCloudIntroduction to SolrCloud
Introduction to SolrCloud
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engine
 
Solr formation Sparna
Solr formation SparnaSolr formation Sparna
Solr formation Sparna
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
 
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
 
Apprendre Solr en deux heures
Apprendre Solr en deux heuresApprendre Solr en deux heures
Apprendre Solr en deux heures
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 

Ähnlich wie Introduction to Apache Solr

Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Roxanne Missingham
 
Why do you consider to adopt Koha Open Source Integrated Library System for y...
Why do you consider to adopt Koha Open Source Integrated Library System for y...Why do you consider to adopt Koha Open Source Integrated Library System for y...
Why do you consider to adopt Koha Open Source Integrated Library System for y...Md. Zahid Hossain Shoeb
 
Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Roy Russo
 
Everything you always wanted to know about WorldCat (but were afraid to ask) ...
Everything you always wanted to know about WorldCat (but were afraid to ask) ...Everything you always wanted to know about WorldCat (but were afraid to ask) ...
Everything you always wanted to know about WorldCat (but were afraid to ask) ...CILIP MDG
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Lucidworks
 
Computer Science Library Training
Computer Science Library TrainingComputer Science Library Training
Computer Science Library Trainingpvhead123
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBSean Laurent
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksLucidworks
 
SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.Julian Hyde
 
ELK stack introduction
ELK stack introduction ELK stack introduction
ELK stack introduction abenyeung1
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Ontico
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrJake Mannix
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkJake Mannix
 
SPARQL in the Semantic Web
SPARQL in the Semantic WebSPARQL in the Semantic Web
SPARQL in the Semantic WebJan Beeck
 
What do you want to discover today? / Janet Aucock, University of St Andrews
What do you want to discover today? / Janet Aucock, University of St AndrewsWhat do you want to discover today? / Janet Aucock, University of St Andrews
What do you want to discover today? / Janet Aucock, University of St AndrewsCIGScotland
 

Ähnlich wie Introduction to Apache Solr (20)

IIPC GA 2014 Solr
IIPC GA 2014 SolrIIPC GA 2014 Solr
IIPC GA 2014 Solr
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
Why do you consider to adopt Koha Open Source Integrated Library System for y...
Why do you consider to adopt Koha Open Source Integrated Library System for y...Why do you consider to adopt Koha Open Source Integrated Library System for y...
Why do you consider to adopt Koha Open Source Integrated Library System for y...
 
Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015
 
Everything you always wanted to know about WorldCat (but were afraid to ask) ...
Everything you always wanted to know about WorldCat (but were afraid to ask) ...Everything you always wanted to know about WorldCat (but were afraid to ask) ...
Everything you always wanted to know about WorldCat (but were afraid to ask) ...
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
 
Computer Science Library Training
Computer Science Library TrainingComputer Science Library Training
Computer Science Library Training
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
 
SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.
 
ELK stack introduction
ELK stack introduction ELK stack introduction
ELK stack introduction
 
Google Dorks
Google DorksGoogle Dorks
Google Dorks
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
SPARQL in the Semantic Web
SPARQL in the Semantic WebSPARQL in the Semantic Web
SPARQL in the Semantic Web
 
What do you want to discover today? / Janet Aucock, University of St Andrews
What do you want to discover today? / Janet Aucock, University of St AndrewsWhat do you want to discover today? / Janet Aucock, University of St Andrews
What do you want to discover today? / Janet Aucock, University of St Andrews
 

Mehr von Andy Jackson

The 'Digital Object Types' Issue
The 'Digital Object Types' IssueThe 'Digital Object Types' Issue
The 'Digital Object Types' IssueAndy Jackson
 
Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?Andy Jackson
 
Seeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archivesSeeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archivesAndy Jackson
 
Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27Andy Jackson
 
Unified characterisation, please
Unified characterisation, pleaseUnified characterisation, please
Unified characterisation, pleaseAndy Jackson
 
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web HistoryFormats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web HistoryAndy Jackson
 

Mehr von Andy Jackson (6)

The 'Digital Object Types' Issue
The 'Digital Object Types' IssueThe 'Digital Object Types' Issue
The 'Digital Object Types' Issue
 
Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?
 
Seeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archivesSeeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archives
 
Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27
 
Unified characterisation, please
Unified characterisation, pleaseUnified characterisation, please
Unified characterisation, please
 
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web HistoryFormats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
 

Kürzlich hochgeladen

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Introduction to Apache Solr

  • 1. Introduction to Apache Solr Andrew Jackson UK Web Archive Technical Lead
  • 2. www.bl.uk 2 Web Archive Overall Architecture
  • 3. www.bl.uk 3 Understanding Your Use Case(s) • Full text search, right? – Yes, but there are many variations and choices to make. • Work with users to understand their information needs: – Are they looking for… • Particular (archived) web resources? • Resources on a particular issue or subject? • Evidence of trends over time? – What aspects of the content do they consider important? – What kind of outputs do they want?
  • 4. www.bl.uk 4 Working With Historians… • JISC AADDA Project: – Initial index and UI of the 1996-2010 data – Great learning experience and feedback – http://domaindarkarchive.blogspot.co.uk/ • AHRC ‘Big Data’ Project: – Second iteration of index and UI – Bursary holders reports coming soon – http://buddah.projects.history.ac.uk/ • Interested in trends and reflections of society – Who links to who/what, over time?
  • 5. www.bl.uk 5 Apache Solr & Lucene • Apache Lucene: – A Java library for full text indexes • Apache Solr: – A web service and API that exposes Lucene functionality in a as a document database – Supports SolrCloud mode for distributed searches • See also: – Elasticsearch (also built around Lucene) – We ‘chose’ Solr before Elasticsearch existed – http://solr-vs-elasticsearch.com/
  • 6. www.bl.uk 6 Example: Indexing Quotes • Quotes to be indexed: – “To do is to be.” - Jean-Paul Sartre – “To be is to do.” - Socrates – “Do be do be do.” - Frank Sinatra • Goals: – Index the quotation for full-text search. • e.g. Show me all quotes that contain “to be”. – Index the author for faceted search. • e.g. Show me all quotes by “Frank Sinatra”.
  • 8. www.bl.uk 8 Solr as a Document Database • Solr Indexes/Stores & Retrieves: – Documents composed of: • Multiple Fields each of which has a defined: – Field Type such as ‘text’, ‘string’, ‘int’, etc. • The queries you can support depend on on many parameters, but the fields and their types are the most critical factors. – See Overview of Documents, Fields, and Schema Design
  • 9. www.bl.uk 9 The Quotes As Solr Documents • Our Documents contain three fields: – ‘id’ field of type ‘string’ – ‘text’ field of type ‘text_general’ – ‘author’ field, of type ‘string’ • Example Documents: – id: “1”, text: “To do is to be.”, author: “Jean-Paul Sartre” – id: “2”, text: “To be is to do.”, author: “Socrates” – id: “3”, text: “Do be do be do.”, author: “Frank Sinatra”
  • 11. www.bl.uk 11 Analyzing The Text Field • Analyzing the text on document 1: – Input: “To do is to be.”, type = ‘text_general’ – Standard Tokeniser: • ‘To’ ‘be’ ‘is’ ‘to’ ‘do’ – Lower Case Filter: • ‘to’ ‘be’ ‘is’ ‘to’ ‘do’ • Adding the tokens to the index: – ‘be’ => id:1 – ‘do’ => id:1 – …
  • 12. www.bl.uk 12 Analyzing The Author Field • Analyzing the author on document 1: – Input: “Jean-Paul Sartre”, type = ‘string’ – Strings are stored as is. • Adding the tokens to the index: – ‘Jean-Paul Sartre’ => id:1
  • 14. www.bl.uk 14 Query for text:“To be” • Uses the same analyser as the indexer: – “To be?” – ST: “To” “be” – LCF: “to” “be” • Returns documents: – 1 – 2
  • 17. www.bl.uk 17 Choice: Ignore ‘stop words’? • Removes common words, unrelated to subject/topic – Input: “To do is to be” – Standard Tokeniser: • ‘To’ ‘be’ ‘is’ ‘to’ ‘do’ – Stop Words Filter (stopwords_en.txt): • ‘do’ – Lower Case Filter: • ‘do’ • Cannot support phrase search – e.g. searching for “to be”
  • 18. www.bl.uk 18 Choice: Stemming? • Attempts to group concepts together: – "fishing", "fished”, "fisher" => "fish" – "argue", "argued", "argues", "arguing”, "argus” => "argu" • Sometimes confused: – "axes” => "axe”, or ”axis”? • Better at grouping related items together • Makes precise phrase searching difficult
  • 19. www.bl.uk 19 So Many Choices… • Lots of text indexing options to tune: – Punctuation and tokenization: • is www.google.com one or three tokens? – Stop word filter (“the” => “”) – Lower case filter (“This” => “this”) – Stemming (choice of algorithms too) – Keywords (excepted from stemming) – Synonyms (“TV” => “Television”) – Possessive Filter (“Blair’s” => “Blair”) – …and many more Tokenizers and Filters.
  • 20. www.bl.uk 20 Even More Choices: Query Features • As well as full-text search variations, we have – Query parsers and features: • Proximity, wildcards, term frequencies, relevance… – Faceted search – Numeric or Date values and range queries – Geographic data and spatial search – Snippets/fragments and highlighting – Spell checking i.e. ‘Did you mean …?’ – MoreLikeThis – Clustering
  • 21. www.bl.uk 21 How to get started? • Experimenting with the UKWA stack: – Indexing: • webarchive-discovery – User Interfaces: • Drupal Sarnia • Shine (Play Framework, by UKWA) • See https://github.com/ukwa/webarchive- discovery/wiki/Front-ends
  • 22. www.bl.uk 22 The webarchive-discovery system • The webarchive-discovery codebase is an indexing stack that reflects our (UKWA) use cases – Contains our choices, reflects our progress so far – Turns ARC or WARC records into Solr Documents – Highly robust against (W)ARC data quality problems • Adds custom fields for web archiving – Text extracted using Apache Tika – Various other analysis features • Workshop sessions will use our setup – but this is only a starting point…
  • 23. www.bl.uk 23 Features: Basic Metadata Fields • From the file system: – The source (W)ARC filename and offset • From the WARC record: – URL, host, domain, public suffix – Crawl date(s) • From the HTTP headers: – Content length – Content type (as served) – Server software IDs
  • 24. www.bl.uk 24 Features: Payload Analysis • Binary hash, embedded metadata • Format and preservation risk analysis: – Apache Tika & DROID format and encoding ID – Notes parse errors to spot access problems – Apache Preflight PDF risk analysis – XML root namespace – Format signature generation tricks • HTML links, elements used, licence/rights URL • Image properties, dominant colours, face detection
  • 25. www.bl.uk 25 Features: Text Analysis • Text extraction from binary formats • ‘Fuzzy’ hash (ssdeep) of text – for similarity analysis • Natural language detection • UK postcode extraction and geo-indexing • Experimental language analysis: – Simplistic sentiment analysis – Stanford NLP named entity extraction – Initial GATE NLP analyser
  • 28. www.bl.uk 28 Scaling Solr • We are operating outside Solr’s sweet spot: – General recommendation is RAM = Index Size – We have a 15TB index. That’s a lot of RAM. • e.g. from this email – “100 million documents [and 16-32GB] per node” – “it's quite the fool's errand for average developers to try to replicate the "heroic efforts" of the few.” • So how to scale up?
  • 29. www.bl.uk 29 Basic Index Performance Scaling • One Query: – Single-threaded binary search – Seek-and-read speed is critical, not CPU • Add RAID/SAN? – More IOPS can support more concurrent queries – BUT each query is no faster • Want faster queries? – Use SSD, and/or – More RAM to cache more disk, and/or – Split the data into more shards (on independent media)
  • 30. www.bl.uk 30 Sharding & SolrCloud • For > ~100 million documents, use shards – More, smaller independent shards == faster search • Shard generation: – SolrCloud ‘Live’ shards • We use Solr’s standard sharding • Randomly distributes records • Supports updates to records – Manual sharding • e.g. ‘static’ shards generated from files • As used by the Danish web archive (see later today)
  • 31. www.bl.uk 31 Next Steps • Prototype, Prototype, Prototype – Expect to re-index – Expect to iterate your front and back end systems – Seek real user feedback • Benchmark, Benchmark, Benchmark – More on scaling issues and benchmarking this afternoon • Work Together – Share use cases, indexing tactics – Share system specs, benchmarks – Share code where appropriate