Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Search Engine Capabilities - Apache Solr(Lucene)

Nächste SlideShare
Solr -
Solr -
Wird geladen in …3

Hier ansehen

1 von 16 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Andere mochten auch (20)


Ähnlich wie Search Engine Capabilities - Apache Solr(Lucene) (20)

Search Engine Capabilities - Apache Solr(Lucene)

  2. 2. AGENDA  About - Search Engine & its capabilities  Apache Solr/Lucene - Introduction  Exploring Lucene  Features & Capabilities  Library Component Stack  Architecture Framework  Work-out  Exploring Solr  Features & Capabilities  Architecture Overview  Work-out search capabilities with Solr  About Pre-requisites & Set up  Work-out search capabilities  What Next – Scope & Future
  3. 3. ABOUT - SEARCH ENGINE & CAPABILITIES An engine/tool which processes the input provided by the end user and find/locate an index of information, documents or web page via applying a certain set of algorithms( indexing, ranking, spider, crawling, querying etc.) defined. A Search engine capabilities varies per the demand, context, content information, model . In basic term, top level of categorization can be derived as – Multiple web sites page/full text search Single site /document full text search Further to above, Search Engine can be categorized as – Crawler-Based Search Engines Social search engines Directories Search Engines Hybrid Search Engines Specialty Search Engines Paid/Promotional Inclusion search engines Pay Per Click (sponsored results) Open source search engines
  4. 4. APACHE SOLR/LUCENE - INTRODUCTION Apache Lucene is a java based high-performance full-featured text search engine library.  Is Developed by Doug Cutting in 1999 and released under Apache Software  Document oriented model architecture  Widely recognized for full text indexing searching capability  Fast indexing up to 150GB/hr and low memory (only 1MB heap)  Flexible API (independent of File Format ex.- pdf, html, word, open document)  Can be used for text/document searching across documents locally and web Extended in the project i.e.– Nutch ,Solr,Elastic search, Compass, DocFetcher
  5. 5. APACHE SOLR/LUCENE - INTRODUCTION Apache Solr is enterprise high performance java based (written over Lucene) search server platform which demonstrate distributed indexing, replication, load- balanced querying, automated failover /recovery with centralized configuration.  Is developed by Yonik Seeley in 2004 at Cnetwork & donated to Apache in 2006  Runs within Servlet container like Tomcat Or Jetty (Default)  Multi Core Architecture Ability to have multiple cores running in the same webapp  Well recognized for distributed search capabilities ex.- cluster search  Open source and extendable via independent plug-in ex. – Carrot  SolrCloud Support fro Cloud based application (2012 Edition)
  6. 6. EXPLORING LUCENE – FEATURES & CAPABILITIES Five key fundamentals on which Lucene works i.e. –  Document  Field  Analyzer(tokens/filter, stop words, synonym, multilingual support…)  Indexing (Inverted Index, encoding, segmentation, data compression, Commit strategy)  Querying/Searching ( Lucene query model, evaluation, scoring, Similarity,extns) As the result of the above, Lucene provides –  High-Performance Indexing ( incremental/batch. Also, size 20-30% the size of text indexed )  Powerful/Complex query processing e.g.- phrase, wildcard, proximity, range, facet , fuzzy query …  Fielded searching and sorting e.g. title, author, contents  Ranked searching ( best results returned first)  Multiple-index searching with merged results  Allows simultaneous index update and searching  Flexible faceting, highlighting, joins and result grouping  Pluggable ranking models including Vector Space Model  Configurable storage engine (Codec's)
  7. 7. LUCENE – LIBRARY COMPONENT STACK Lucene Test Framework Lucene Analyzer Lucene Indexer Spatial Benchmark Grouping Analyzer ICU Suggest facet Sandbox Highlighter Query Parser Query Analyzer Common Analyzer Phonetic Analyzer UIMA Analyzer Stemple Analyzer Smart CN Analyzer Koromogi Analyzer Morfologik misc joinmemory Lucene Codec Lucene Core Search Payload Similarity Store Finite State Transducer Compress UtilAutomation Document Packed Int/Array Span AnalysisIndex Codec Codec Per Field
  8. 8. Exploring Lucene - Architecture Framework Directories Codec Index Writer Index Reader query Scoring API Collection Text Analysis Chain Query Parser Doc Writer Index Chain Segment Segment Reader Collection Stat TextEnum DocAndPositionEnum <doc> <field></field> <field></field> … </doc> Ranked Result Search segment Flush/commit Open/Reopen Add/ update Retrieve Stored value Per field token stream
  9. 9. Exploring Lucene – Work out Setting your CLASSPATH Download and extract Lucene distribution and jars (Lucene Core , Queryparser, common analysis)in your Java CLASSPATH Indexing Files Analyzer analyzer = new StandardAnalyzer ( Version.LUCENE_CURRENT ); Directory directory = FSDirectory.open("/tmp/testindex"); IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CU RRENT, analyzer); IndexWriter iwriter = new IndexWriter( directory, config); Document doc = new Document(); String text = "This is the text to be indexed."; doc.add(new Field("fieldname", text, TextField.TYPE_STORED)); iwriter.addDocument(doc); iwriter.close(); Searching Files DirectoryReader ireader = DirectoryReader.open( directory ); IndexSearcher isearcher = new IndexSearcher( ireader ); QueryParser parser = new QueryParser(Version.LUCENE_CURREN T, "fieldname", analyzer); Query query = parser.parse("text"); ScoreDoc[] hits = isearcher.search (query, null, 1000).scoreDocs; for (int i = 0; i < hits.length; i++) { Document hitDoc = isearcher.doc(hits[i].doc); assertEquals("This is the text to be indexed.", hitDoc.get("fieldname")); } ireader.close(); directory.close();
  10. 10. EXPLORING SOLR – FEATURES & CAPABILITIES Solr Feature (in addition to Lucene) Caching Document cache instances  User level caching Pluggable Cache implementations SolrCloud Automated distributed indexing/sharding Real time indexing Transaction log Query fail over and recovery Additional Ajax Based Admin Interface with a bundle of functionality cache and logging mgmt Monitoring  text analysis debugging schema browser web query output solr cloud dashboard etc… Rich Document Parsing and Indexing (PDF, Word, HTML, etc) using Apache Tika Apache UIMA integration for configurable metadata extraction Solr Core Multi-Core Analysis and Indices Dynamically create/ delete document collections Pluggable query handlers Extensible XML data format Component based request handler Distributed search support Uniqueness/duplicate document Detection Custom index processing chains
  11. 11. SOLR ARCHITECTURE 11 Apache Lucene /select /spell XML CSVXML Binary JSO N Data Import Handler (SQL/RSS) Extracting Request Handler (PDF/WORD) CachingFaceting Query Parsing Apache Tika binary/admin High-lighting Schema<fieldType name=“text1”> <filter=“whitespace”> <filter=“customFilter” …> <filter=“synonyms” file=..> <filter=“porter” except=..> <field name=“title” type=“text1” <field name=“cust1” class= Index Replication Update HandlersResponse Writers Query Spelling Faceting Highlighting Signature Logging Update Processors Indexing SolrConfig Debug Statistics More like this Distributed Search Clustering Filtering Search Core Search IndexReader/Searcher Indexing IndexWriterText Analysis Analysis Request Handler http://.../select?q=cheese&wt=xml
  12. 12. APACHE SOLR – PRE-REQUISITES & SET UP Solr Component Set up Container  Contrib module for extensions to Solr  Analysis -extras text analysis components for multilingual support i.e.- Chinese  Clustering engine for clustering search results  DataImportHandler (DIH) is contrib module that imports data into Solr from a srces  Extraction contains integration with Apache Tika ( a framework for extracting text from common file formats and also used by DIH's TikaEntityProcessor).  UIMA for integration with Apache UIMA (a framework for extracting metadata out of text, identify proper names in text and identify the language).  Velocity is Simple Search UI framework based on the Velocity templating language.  Dist Solr distributable WAR and contrib jar files
  13. 13. APACHE SOLR – PRE-REQUISITES & SET UP Solr Component Set up Container  Example contains a complete Solr server with Jetty servlet engine, serving as demo • example/etc contains Jetty's server configuration • exampledocs contains Sample documents to be indexed into the default Solr configuration along with the post.jar for sending documents to Solr. • example/solr is the default sample Solr configuration • example/webapps is the place Jetty expects to deploy Solr from
  14. 14. QUERY EXAMPLES DisMax - http://solr/select?qt=dismax&start=0&rows=2 &q=super man // user query &qf=title^3 subject^2 body // field to query &pf=title^2,body // fields to do phrase queries &ps=100 // slop for those phrase q’s &tie=.1 // multi-field match reward &mm=2 // # of terms that should match &bf=popularity // boost function Facet - http://solr/select?q=foo&wt=json&indent=on&facet=true&facet.field=cat &facet.query=price:[0 TO 100]&facet.query=manu:IBM Filter - &q=memory&fq=inStock:true&facet=true&… Highlighting - http://solr/select?q=lcd&wt=json&indent=on&hl=true&hl.fl=features  Date Range - releaseDate:[2000 TO 2007]  Wildcard - sup?r, su*r, super*  Fuzzy - Levenshtein distance Optional minimum similarity: spider~0.7  Boolean - (Superman AND “Lex Luthor”) OR (+Batman +Joker)  Balanced quotes for phrase query - ‘+’ for required, ‘-’ for prohibited  Functional - log(sum(popularity,1))
  15. 15. REFERENCES  http://lucene.apache.org/core/features.html  http://lucene.apache.org/solr/4_1_0/tutorial.html  http://en.wikipedia.org/wiki/Index_(search_engine)  http://lucene.apache.org/solr/4_1_0/tutorial.html  http://wiki.apache.org/solr/SolrQuerySyntax  http://wiki.apache.org/solr/SolrFacetingOverview  http://horicky.blogspot.in/2013/02/text-processing-part-2-inverted-index.html
  16. 16. THANK YOU

Hinweis der Redaktion

  • Apache Nutch — provides web crawling and HTML parsing
    Apache Solr — an enterprise search server
    ElasticSearch — an enterprise search server
    Compass — a Java Search Engine Framework
    DocFetcher — a multiplatform desktop search applicatio
  • Accepts several types of queries:

    Term query
    (e.g., buffer edit)

    Phrase query
    (e.g., “buffer edit”)

    Boolean query
    (e.g., buffer AND edit OR modify)

    Wildcard query
    (e.g., te?t, test*, te*t)

    Range query
    (e.g., date: [20020101 TO 20030101)

    Fuzzy query
    - uses the Levenshtein Distance between
    strings (e.g., roam~ searches fo
    r terms similar to roam, like
    “roam”, “foam”)

    Proximity query
    – finds terms within a specific distance
    away (e.g., “jakarta apache”~10 searches for a “apache”
    and “jakarta” within 10 terms of
    each other in a document
  • RequestHandlers – handle a request at a URL like /select
    SearchComponents – part of a SearchHandler, a componentized request handler
    Includes, Query, Facet, Highlight, Debug, Stats
    Distributed Search capable
    UpdateHandlers – handle an indexing request
    Update Processor Chains – per-handler componentized chain that handle updates
    Query Parser plugins
    Mix and match query types in a single request
    Function plugins for Function Query
    Text Analysis plugins: Analyzers, Tokenizers, TokenFilters
    ResponseWriters serialize & stream response to client

    Each request handler can be mapped to a different URL
    SearchHandler is a componentized RequestHandler that allows search components to be chained together and also enables the framework for distributed search operations.
    Each Searchhandler can have it’s own custom set of search components, along with default or invariant parameters
    All of the configuration is declarative – including adding new request handlers or search components.
    The QueryResponse object is very generic and can handle returning any type of data