Building your own search engine with Apache Solr

•

2 gefällt mir•3,485 views

Grow-your-own search engine Solr is a search server built on Lucene that provides indexing, relevance ranking, and other search features through REST web services. It allows configuring search through XML without coding and is used by many large companies. Solr can index various data types including documents, databases, and crawled content. Queries are parsed and run against the index to return ranked search results based on factors like term frequency and inverse document frequency. Case studies show how Solr can improve search performance for databases like CATH by indexing its protein structure data.

Technologie

Grow-your-own search engine

An introduction to Solr
Andrew Clegg

http://lucene.apache.org/solr/

Search engine toolkit

Indexing
Relevance ranking
Spellchecking
Hit highlighting
Text processing tools
etc. etc.

Originally in Java
Ported to C#.NET, Python, C (Lucy)
Perl & Ruby bindings for Lucy

Not an off-the-shelf application — requires coding

Solr is a search server built on Lucene

Web admin interface

Configured via XML — no coding required

Controlled via HTTP — REST-like web services

Database-like schema — typed fields, unique keys

Extended query syntax

Caching, replication, clustering, delta indexing, sharding

Used by: AOL, news.com, Digg, Apple, Disney, MTV,
CiteSeerX, PubGet, NASA, the White House…

Indexing

‗documents‘ indexing index

XML Tokenization etc.
CSV Term extraction
HTTP request Frequency
Database query calculations

Web crawling & PDF/Word/etc. extraction via other tools

Querying

users query parsing index lookup index

ranked hits:
database/document IDs
… and/or …
content stored in index

Query syntax example:

name:(kinase OR phosphotransferase)
AND species:Saccharomyces
AND description:phosphoryl*
AND createdate:[-1YEAR/DAY TO NOW]

―name contains kinase or phosphotransferase
and species contains Saccharomyces
and description contains word starting with phosphoryl
and createdate at any time between one year ago
(rounded to the nearest day) and now‖

Ranking based on tf.idf model…

tf = term frequency
idf = inverse document frequency

Query terms appearing frequently in this
document are up-weighted

… but …

Query terms appearing in many
documents are down-weighted

Highest ranked docs contain multiple instances of rare query terms

Case study: indexing the CATH protein
domain database with Solr

http://cathdb.info/

CLASS e.g. Mixed Alpha/Beta depth = 1

ARCHITECTURE e.g. Alpha/Beta Barrel depth = 2

TOPOLOGY e.g. TIM Barrel depth = 3
(fold family)

HOMOLOGOUS e.g. Aldolase class I depth = 4
SUPERFAMILY

Previous CATH search box used plain SQL –
lots of LIKE/equals matches against relevant fields

• Slow
• Poor coverage (exact string match needed)
• No relevance ordering
• Slow
• Hard to debug unexpected results
• Can‘t be exposed as a service
• Did we mention slow...?

Entities to index:
• CATH nodes, esp. superfamilies
• Proteins
• Chains
• Domains UniProt PDB

GO CATHDB fetch &
index
(PostgreSQL) user
query

EC CATH
KEGG website

Fetching/indexing data needs no coding –
just XML and a bit of SQL

Fun with indexes

• “Give me the 10 most similar domains to this one”
Based on term similarity.
Useful for quickly finding functional analogues.

• “Give me the top 10 keywords for this superfamily”
Can be used to generate tag clouds & summaries.
Useful for data mining too.

Some facts and figures

• Indexing takes ~ 6 hours (dedicated box)
• Index is currently 2.0GB
• 357,850 entities in index (some are 1-3MB)
• Typical single query time in Solr: 1ms – 200ms
• Total web query time: ~ 3s

Any questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Get the most out of Solr search with PHPPaul Borgermans

Integrating the Solr search engineth0masr

Lucene's Latest (for Libraries)Erik Hatcher

Rapid Prototyping with SolrErik Hatcher

Using Apache Solrpittaya

Solr Black Belt Pre-conferenceErik Hatcher

Introduction to Apache Lucene/SolrRahul Jain

Building a Search Engine Using LuceneAbdelrahman Othman Helal

code4lib 2011 preconference: What's New in Solr (since 1.4.1)Erik Hatcher

Rapid Prototyping with SolrErik Hatcher

Solr introductionLap Tran

Intro to Apache Lucene and SolrGrant Ingersoll

Introduction to Apache SolrAlexandre Rafalovitch

Lucene for Solr DevelopersErik Hatcher

Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Alexandre Rafalovitch

Introduction to Apache Solr.ashish0x90

Tutorial on developing a Solr search component pluginsearchbox-com

Beyond full-text searches with Lucene and SolrBertrand Delacretaz

Introduction to SolrErik Hatcher

Solr PresentationGaurav Verma

Was ist angesagt? (20)

Get the most out of Solr search with PHP

Integrating the Solr search engine

Lucene's Latest (for Libraries)

Rapid Prototyping with Solr

Using Apache Solr

Solr Black Belt Pre-conference

Introduction to Apache Lucene/Solr

Building a Search Engine Using Lucene

code4lib 2011 preconference: What's New in Solr (since 1.4.1)

Rapid Prototyping with Solr

Solr introduction

Intro to Apache Lucene and Solr

Introduction to Apache Solr

Lucene for Solr Developers

Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)

Introduction to Apache Solr.

Tutorial on developing a Solr search component plugin

Beyond full-text searches with Lucene and Solr

Introduction to Solr

Solr Presentation

Ähnlich wie Building your own search engine with Apache Solr

An Introduction to Elastic Search.Jurriaan Persyn

SolrClaudio Devecchi

2009 Dils FlywebJun Zhao

Intro to ElasticsearchClifford James

Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Fwdays

ProjectHubSematext Group, Inc.

2010 03 Lodoxf OpenflydataJun Zhao

Solr - Hao Chen 陈浩

Tagging search solution designAlexander Tokarev

BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchNetConstructor, Inc.

Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć

Tagging search solution design Advanced editionAlexander Tokarev

NoSQL, Apache SOLR and Apache HadoopDmitry Kan

Introduction to elasticsearchpmanvi

Apache solrDipen Rangwani

Emerging technologies /frameworks in Big DataRahul Jain

About elasticsearchMinsoo Jun

Slug: A Semantic Web CrawlerLeigh Dodds

Large Scale Crawling with Apache Nutch and Friendslucenerevolution

Large Scale Crawling with Apache Nutch and FriendsJulien Nioche

Ähnlich wie Building your own search engine with Apache Solr (20)

An Introduction to Elastic Search.

Solr

2009 Dils Flyweb

Intro to Elasticsearch

Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...

ProjectHub

2010 03 Lodoxf Openflydata

Solr -

Tagging search solution design

BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch

Battle of the giants: Apache Solr vs ElasticSearch

Tagging search solution design Advanced edition

NoSQL, Apache SOLR and Apache Hadoop

Introduction to elasticsearch

Apache solr

Emerging technologies /frameworks in Big Data

About elasticsearch

Slug: A Semantic Web Crawler

Large Scale Crawling with Apache Nutch and Friends

Mehr von Biogeeks

Perl cures coronary heart diseaseBiogeeks

Poing: a coder’s take on protein modellingBiogeeks

Identifying genes and proteins in text: a short review of available tools and...Biogeeks

DASbrick: A cloud based Rich internet application for Synthetic Biology Parts...Biogeeks

Ondex: Data integration and visualisationBiogeeks

ABC-SysBio – Approximate Bayesian Computation in Python with GPU supportBiogeeks

Mehr von Biogeeks (6)

Perl cures coronary heart disease

Poing: a coder’s take on protein modelling

Identifying genes and proteins in text: a short review of available tools and...

DASbrick: A cloud based Rich internet application for Synthetic Biology Parts...

Ondex: Data integration and visualisation

ABC-SysBio – Approximate Bayesian Computation in Python with GPU support

Kürzlich hochgeladen

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

How to convert PDF to text with Nanonetsnaman860154

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Evaluating the top large language models.pdfChristopherTHyatt

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

GenAI Risks & Security Meetup 01052024.pdflior mazor

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Partners Life - Insurer Innovation Award 2024The Digital Insurer

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Scaling API-first – The story of a global engineering organizationRadu Cotescu

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Kürzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison

Presentation on how to chat with PDF using ChatGPT code interpreter

How to convert PDF to text with Nanonets

A Domino Admins Adventures (Engage 2024)

Automating Google Workspace (GWS) & more with Apps Script

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Evaluating the top large language models.pdf

08448380779 Call Girls In Friends Colony Women Seeking Men

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

GenAI Risks & Security Meetup 01052024.pdf

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Partners Life - Insurer Innovation Award 2024

Boost Fertility New Invention Ups Success Rates.pdf

[2024]Digital Global Overview Report 2024 Meltwater.pdf

What Are The Drone Anti-jamming Systems Technology?

Handwritten Text Recognition for manuscripts and early printed texts

Scaling API-first – The story of a global engineering organization

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Building your own search engine with Apache Solr

1. Grow-your-own search engine An introduction to Solr Andrew Clegg http://lucene.apache.org/solr/

2. Search engine toolkit Indexing Relevance ranking Spellchecking Hit highlighting Text processing tools etc. etc. Originally in Java Ported to C#.NET, Python, C (Lucy) Perl & Ruby bindings for Lucy Not an off-the-shelf application — requires coding

3. Solr is a search server built on Lucene Web admin interface Configured via XML — no coding required Controlled via HTTP — REST-like web services Database-like schema — typed fields, unique keys Extended query syntax Caching, replication, clustering, delta indexing, sharding Used by: AOL, news.com, Digg, Apple, Disney, MTV, CiteSeerX, PubGet, NASA, the White House…

4. Indexing ‗documents‘ indexing index XML Tokenization etc. CSV Term extraction HTTP request Frequency Database query calculations Web crawling & PDF/Word/etc. extraction via other tools

5. Querying users query parsing index lookup index ranked hits: database/document IDs … and/or … content stored in index

6. Query syntax example: name:(kinase OR phosphotransferase) AND species:Saccharomyces AND description:phosphoryl* AND createdate:[-1YEAR/DAY TO NOW] ―name contains kinase or phosphotransferase and species contains Saccharomyces and description contains word starting with phosphoryl and createdate at any time between one year ago (rounded to the nearest day) and now‖

7. Ranking based on tf.idf model… tf = term frequency idf = inverse document frequency Query terms appearing frequently in this document are up-weighted … but … Query terms appearing in many documents are down-weighted Highest ranked docs contain multiple instances of rare query terms

8. Case study: indexing the CATH protein domain database with Solr http://cathdb.info/

9. Whole protein

10. Whole protein Chain

11. Whole protein Chain Domain

12. CLASS e.g. Mixed Alpha/Beta depth = 1 ARCHITECTURE e.g. Alpha/Beta Barrel depth = 2 TOPOLOGY e.g. TIM Barrel depth = 3 (fold family) HOMOLOGOUS e.g. Aldolase class I depth = 4 SUPERFAMILY

13. Previous CATH search box used plain SQL – lots of LIKE/equals matches against relevant fields • Slow • Poor coverage (exact string match needed) • No relevance ordering • Slow • Hard to debug unexpected results • Can‘t be exposed as a service • Did we mention slow...?

14. Entities to index: • CATH nodes, esp. superfamilies • Proteins • Chains • Domains UniProt PDB GO CATHDB fetch & index (PostgreSQL) user query EC CATH KEGG website Fetching/indexing data needs no coding – just XML and a bit of SQL

15.

16. Fun with indexes • “Give me the 10 most similar domains to this one” Based on term similarity. Useful for quickly finding functional analogues. • “Give me the top 10 keywords for this superfamily” Can be used to generate tag clouds & summaries. Useful for data mining too.

17. Some facts and figures • Indexing takes ~ 6 hours (dedicated box) • Index is currently 2.0GB • 357,850 entities in index (some are 1-3MB) • Typical single query time in Solr: 1ms – 200ms • Total web query time: ~ 3s Any questions?

Building your own search engine with Apache Solr

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Building your own search engine with Apache Solr

Ähnlich wie Building your own search engine with Apache Solr (20)

Mehr von Biogeeks

Mehr von Biogeeks (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Building your own search engine with Apache Solr