SlideShare ist ein Scribd-Unternehmen logo
1 von 19
DEVELOP OPEN
                SOURCE SEARCH
                ENGINE
26th Feb 2012   Ritesh Ambastha – CEO, iWillStudy.com
Open Source Search Engines

          Lucen     Datapark
Sphinx               Search
                               Zettair
            e

 YaCy     Xapian    SWISH-E    Seeks


 Recoll   OpenFTS    Nutch     Namazu
Platform Ideas !




                   Credits: http://zooie.wordpress.com
Comparision




              Credits: http://zooie.wordpress.com
Comparision




              Credits: http://zooie.wordpress.com
We are going to talk
      about
 Sphinx & Apache-
       Solr
Sphinx
   Sphinx is an open source full
    text search server.
   It's written in C++ and works
    on Linux
    (RedHat, Ubuntu, etc), Window
    s, MacOS, Solaris, FreeBSD, a
    nd a few other systems.
   Sphinx lets you either batch
    index and search data stored
    in an SQL database, NoSQL
    storage, or just files quickly
    and easily
Sphinx

 Text processing features
 Searching via SphinxAPI is as simple as
  3 lines of code, and querying via
  SphinxQL is even simpler
 Sphinx clusters scale up to billions of
  documents and tens of millions search
  queries per day, powering top websites
  such as
  Craigslist, DailyMotion, NetLog, etc.
Performance and scalability
   Indexing performance: Sphinx indexes up to 10-
    15 MB of text per second per single CPU core.
   Searching performance: Searching through
    1,000,000-document, 1.2 GB text collection that
    they use for everyday development and testing runs
    at 500+ queries/sec on a 2-core desktop machine
    with 2 GB of RAM.
   Scalability: Biggest known Sphinx cluster indexes
    almost 5 billion documents, resulting in over 6 TB of
    data.
   Busiest known one is, unsurpisingly, Craigslist, top-
    10 website in the US that serves 50+ million search
Key Features
   Batch and Real-Time full-text indexes
   Non-text attributes support
   SQL database indexing
   Non-SQL storage indexing
   Easy application integration
   Advanced full-text searching syntax
   Rich database-like querying features
   Better relevance ranking
   Flexible text processing
   Distributed searching
http://lucene.apache.org/solr/
Solr is the
popular, blazing fast
open source enterprise
search platform from
the Apache Lucene
project.
Its major features include
powerful full-text search, hit
highlighting, faceted
search, dynamic
clustering, database
integration, rich document
(e.g., Word, PDF)
handling, and geospatial
Solr is written in Java
and runs as a
standalone full-text
search server within a
servlet container such
as Tomcat.
Solr Features
   Advanced Full-Text Search Capabilities
   Optimized for High Volume Web Traffic
   Standards Based Open Interfaces - XML,JSON
    and HTTP
   Comprehensive HTML Administration Interfaces
   Server statistics exposed over JMX for monitoring
   Scalability - Efficient Replication to other Solr
    Search Servers
   Flexible and Adaptable with XML configuration
   Extensible Plugin Architecture
What is it all about?
Solr is based on Lucene
More about Lucene

Weitere ähnliche Inhalte

Was ist angesagt?

Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Lucidworks (Archived)
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
ebiquity
 
Scaling Analytics with elasticsearch
Scaling Analytics with elasticsearchScaling Analytics with elasticsearch
Scaling Analytics with elasticsearch
dnoble00
 

Was ist angesagt? (20)

ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
Elastic search
Elastic searchElastic search
Elastic search
 
Resource Oriented Architecture
Resource Oriented ArchitectureResource Oriented Architecture
Resource Oriented Architecture
 
Seravia in the Cloud
Seravia in the CloudSeravia in the Cloud
Seravia in the Cloud
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with Elasticsearch
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
ElasticSearch Basics
ElasticSearch Basics ElasticSearch Basics
ElasticSearch Basics
 
Use Cases for Elastic Search Percolator
Use Cases for Elastic Search PercolatorUse Cases for Elastic Search Percolator
Use Cases for Elastic Search Percolator
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big features
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational database
 
Solr 8 interview
Solr 8 interview Solr 8 interview
Solr 8 interview
 
E commerce Search using Apache Solr
E commerce Search using Apache SolrE commerce Search using Apache Solr
E commerce Search using Apache Solr
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
 
Big data Intro by Kaushik Dutta
Big data Intro by Kaushik DuttaBig data Intro by Kaushik Dutta
Big data Intro by Kaushik Dutta
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Scaling Analytics with elasticsearch
Scaling Analytics with elasticsearchScaling Analytics with elasticsearch
Scaling Analytics with elasticsearch
 

Ähnlich wie Develop open source search engine

Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
Mark Miller
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
Swapnil & Patil
 
PyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in pythonPyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in python
Chetan Giridhar
 
ApacheCon NA 2011 report
ApacheCon NA 2011 reportApacheCon NA 2011 report
ApacheCon NA 2011 report
Koji Kawamura
 

Ähnlich wie Develop open source search engine (20)

Get involved with the Apache Software Foundation
Get involved with the Apache Software FoundationGet involved with the Apache Software Foundation
Get involved with the Apache Software Foundation
 
What is Apache Solr? Check Out Its Advantages
What is Apache Solr? Check Out Its AdvantagesWhat is Apache Solr? Check Out Its Advantages
What is Apache Solr? Check Out Its Advantages
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
 
Semtech 2011 impressions
Semtech 2011 impressionsSemtech 2011 impressions
Semtech 2011 impressions
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
963
963963
963
 
In search of: A meetup about Liferay and Search 2016-04-20
In search of: A meetup about Liferay and Search   2016-04-20In search of: A meetup about Liferay and Search   2016-04-20
In search of: A meetup about Liferay and Search 2016-04-20
 
Building a semantic website
Building a semantic websiteBuilding a semantic website
Building a semantic website
 
PyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in pythonPyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in python
 
Oss and libraries enabling arabic libraries and creating opportunities
Oss and libraries   enabling arabic libraries and creating opportunitiesOss and libraries   enabling arabic libraries and creating opportunities
Oss and libraries enabling arabic libraries and creating opportunities
 
ApacheCon NA 2011 report
ApacheCon NA 2011 reportApacheCon NA 2011 report
ApacheCon NA 2011 report
 
Elastic pivorak
Elastic pivorakElastic pivorak
Elastic pivorak
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 

Mehr von NAILBITER

Cloud Workshop - Presentation
Cloud Workshop - PresentationCloud Workshop - Presentation
Cloud Workshop - Presentation
NAILBITER
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
NAILBITER
 
iWillStudy.com - Light Pitch
iWillStudy.com - Light PitchiWillStudy.com - Light Pitch
iWillStudy.com - Light Pitch
NAILBITER
 
Android Workshop - Session 2
Android Workshop - Session 2Android Workshop - Session 2
Android Workshop - Session 2
NAILBITER
 
Android Workshop Session 1
Android Workshop Session 1Android Workshop Session 1
Android Workshop Session 1
NAILBITER
 

Mehr von NAILBITER (20)

Social Media Strategies
Social Media StrategiesSocial Media Strategies
Social Media Strategies
 
jQuery for Beginners
jQuery for Beginners jQuery for Beginners
jQuery for Beginners
 
GBGahmedabad - Create your Business Website
GBGahmedabad - Create your Business WebsiteGBGahmedabad - Create your Business Website
GBGahmedabad - Create your Business Website
 
Mapathon 2013 - Google Maps Javascript API
Mapathon 2013 - Google Maps Javascript APIMapathon 2013 - Google Maps Javascript API
Mapathon 2013 - Google Maps Javascript API
 
Cloud Workshop - Presentation
Cloud Workshop - PresentationCloud Workshop - Presentation
Cloud Workshop - Presentation
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
iWillStudy.com - Light Pitch
iWillStudy.com - Light PitchiWillStudy.com - Light Pitch
iWillStudy.com - Light Pitch
 
Cloud Summit Ahmedabad
Cloud Summit AhmedabadCloud Summit Ahmedabad
Cloud Summit Ahmedabad
 
Android Fundamentals & Figures of 2012
Android Fundamentals & Figures of 2012Android Fundamentals & Figures of 2012
Android Fundamentals & Figures of 2012
 
The iPhone development on windows
The iPhone development on windowsThe iPhone development on windows
The iPhone development on windows
 
Ambastha EduTech Pvt Ltd
Ambastha EduTech Pvt LtdAmbastha EduTech Pvt Ltd
Ambastha EduTech Pvt Ltd
 
Branding
BrandingBranding
Branding
 
Advertising
AdvertisingAdvertising
Advertising
 
Location based solutions maps & your location
Location based solutions   maps & your locationLocation based solutions   maps & your location
Location based solutions maps & your location
 
Html5 workshop part 1
Html5 workshop part 1Html5 workshop part 1
Html5 workshop part 1
 
Android Workshop - Session 2
Android Workshop - Session 2Android Workshop - Session 2
Android Workshop - Session 2
 
Android Workshop Session 1
Android Workshop Session 1Android Workshop Session 1
Android Workshop Session 1
 
Linux Seminar for Beginners
Linux Seminar for BeginnersLinux Seminar for Beginners
Linux Seminar for Beginners
 
Linux advanced concepts - Part 2
Linux advanced concepts - Part 2Linux advanced concepts - Part 2
Linux advanced concepts - Part 2
 
Linux advanced concepts - Part 1
Linux advanced concepts - Part 1Linux advanced concepts - Part 1
Linux advanced concepts - Part 1
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Develop open source search engine

  • 1. DEVELOP OPEN SOURCE SEARCH ENGINE 26th Feb 2012 Ritesh Ambastha – CEO, iWillStudy.com
  • 2. Open Source Search Engines Lucen Datapark Sphinx Search Zettair e YaCy Xapian SWISH-E Seeks Recoll OpenFTS Nutch Namazu
  • 3. Platform Ideas ! Credits: http://zooie.wordpress.com
  • 4. Comparision Credits: http://zooie.wordpress.com
  • 5. Comparision Credits: http://zooie.wordpress.com
  • 6. We are going to talk about Sphinx & Apache- Solr
  • 7. Sphinx  Sphinx is an open source full text search server.  It's written in C++ and works on Linux (RedHat, Ubuntu, etc), Window s, MacOS, Solaris, FreeBSD, a nd a few other systems.  Sphinx lets you either batch index and search data stored in an SQL database, NoSQL storage, or just files quickly and easily
  • 8. Sphinx  Text processing features  Searching via SphinxAPI is as simple as 3 lines of code, and querying via SphinxQL is even simpler  Sphinx clusters scale up to billions of documents and tens of millions search queries per day, powering top websites such as Craigslist, DailyMotion, NetLog, etc.
  • 9. Performance and scalability  Indexing performance: Sphinx indexes up to 10- 15 MB of text per second per single CPU core.  Searching performance: Searching through 1,000,000-document, 1.2 GB text collection that they use for everyday development and testing runs at 500+ queries/sec on a 2-core desktop machine with 2 GB of RAM.  Scalability: Biggest known Sphinx cluster indexes almost 5 billion documents, resulting in over 6 TB of data.  Busiest known one is, unsurpisingly, Craigslist, top- 10 website in the US that serves 50+ million search
  • 10. Key Features  Batch and Real-Time full-text indexes  Non-text attributes support  SQL database indexing  Non-SQL storage indexing  Easy application integration  Advanced full-text searching syntax  Rich database-like querying features  Better relevance ranking  Flexible text processing  Distributed searching
  • 12. Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project.
  • 13. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial
  • 14. Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat.
  • 15. Solr Features  Advanced Full-Text Search Capabilities  Optimized for High Volume Web Traffic  Standards Based Open Interfaces - XML,JSON and HTTP  Comprehensive HTML Administration Interfaces  Server statistics exposed over JMX for monitoring  Scalability - Efficient Replication to other Solr Search Servers  Flexible and Adaptable with XML configuration  Extensible Plugin Architecture
  • 16. What is it all about?
  • 17.
  • 18. Solr is based on Lucene