Axa Assurance Maroc - Insurer Innovation Award 2024
Â
TriHUG: Lucene Solr Hadoop
1. Where It All Began Using Apache Hadoop for Search with Apache Lucene and Solr
2. Topics Search What is: Apache Lucene? Apache Nutch? Apache Solr? Where does Hadoop (ecosystem) fit? Indexing Search Other
3. Search 101 Search tools are designed for dealing with fuzzy data Works well with structured and unstructured data Performs well when dealing with large volumes of data Many apps donât need the limits that databases place on content Search fits well alongside a DB too Given a userâs information need, (query) find and, optionally, score content relevant to that need Many different ways to solve this problem, each with tradeoffs Whatâs ârelevantâ mean?
4. Search 101 Relevance Indexing Finds and maps terms and documents Conceptually similar to a book index At the heart of fast search/retrieve Vector Space Model (VSM) for relevance Common across many search engines Apache Lucene is a highly optimized implementation of the VSM
5. Lucene is a mature, high performance Java API to provide search capabilities to applications Supports indexing, searching and a number of other commonly used search features (highlighting, spell checking, etc.) Not a crawler and doesnât know anything about Adobe PDF, MS Word, etc. Created in 1997 and now part of the Apache Software Foundation Important to note that Lucene does not have distributed index (shard) support
6. Nutch ASF project aimed at providing large scale crawling, indexing and searching using Lucene and other technologies Mike Cafarella and Doug Cutting originally created Hadoop as part of Nutch based on the Google paper by Dean and Ghemawat http://labs.google.com/papers/mapreduce.html Only much later did it spin out to become the Hadoop that we all know In other words, Hadoop was born from the need to scale search crawling and indexing Originally used Lucene for search/indexing, now uses Solr
7. Solr Solr is the Lucene based search server providing the infrastructure required for most users to work with Lucene Without knowing Java! Also provides: Easy setup and configuration Faceting Highlighting Replication/Sharding Lucene Best Practices http://search.lucidimagination.com
8. Lucene Basics Content is modeled via Documents and Fields Content can be text, integers, floats, dates, custom Analysis can be employed to alter content before indexing Searches are supported through a wide range of Query options Keyword Terms Phrases Wildcards, other
9. Quick Solr Demo Pre-reqs: Apache Ant 1.7.x SVN svn co https://svn.apache.org/repos/asf/lucene/dev/trunksolr-trunk cdsolr-trunk/solr/ ant example cd example java âjar start.jar cdexampledocs; java âjar post.jar *.xml http://localhost:8983/solr/browse
10. Anatomy of a Distributed Search System Users Input Docs Application Fan In/Out Shard[0] Shard[n] Sharding Alg. Coordination Layer Searchers Indexers ⊠⊠⊠⊠⊠Shard[0] Shard[n]
11. Sharding Algorithm Good document distribution across shards is important Simple approach: hash(id) % numShards Fine if number of shards doesnât change or easy to reindex Better: Consistent Hashing http://en.wikipedia.org/wiki/Consistent_hashing Also key: how to deal with the shape/size of the cluster changing
12. Hadoop and Search Much of the Hadoop ecosystem is useful for search related functionality Indexing Process of adding documents to inverted index to make them searchable In most cases, batch-oriented and embarrassingly parallel, so Hadoop Core can help Search Query the index and return documents and other info (facets, etc.) related to the result set Subsecond response time usually required ZooKeeper, Avro and others are still useful
13. Indexing (Lucene) Hadoop ships with contrib/index Almost no documentation, but⊠Good example of map-side indexing Mapper does analysis and creates in memory index which is written out to segments Indexes merged on the reduce side Katta http://katta.sourceforge.net Shard management, distributed search, etc. Both give you large amount of control, but you have to build out all the search framework around it
14. Indexing (Solr) https://issues.apache.org/jira/browse/SOLR-1301 Map side formats Reduce-side indexing Creates indexes on local file system (outside of HDFS) and copies to default FS (HDFS, etc.) Manually install index into a Solr core once built https://issues.apache.org/jira/browse/SOLR-1045 Map-side indexing Incomplete, but based on Hadoop contrib/index Write a distributed Update Handler to handle on the server side
15. Indexing (Nutch to Solr) Use Nutch to crawl content, Solr to index and serve Doesnât support indexing to Solr shards just yet Need to write/use Solr distributed Update Handler Still useful for smaller crawls (< 100M pages) http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/
16. Searching Hadoop Core is not all that useful for distributed search Exception: Hadoop RPC layer, possibly Exception: Log analysis, etc. for search related items Other Hadoop ecosystem tools are useful: Apache ZooKeeper (more in a moment) HDFS â storage of shards (pull down to local disk) Avro, Thrift, Protocol Buffers (serialization utilities)
17. ZooKeeper and Search ZooKeeper is a centralized service for coordination, configuration, naming and distributed synchronization In the context of search, itâs useful for: Sharing configuration across nodes Maintaining status about shards Up/down/latency/rebalancing and more Coordinating searches across shards/load balancing
18. ZooKeeper and Search (Practical) Katta employs ZooKeeper for search coordination, etc. Query distribution, status, etc. Solr Cloud All the benefits of Solr + ZooKeeper for coordinating distributed capabilities Query distribution, configuration sharing, status, etc. About to be committed to Solr trunk http://wiki.apache.org/solr/SolrCloud
19. Other Search Related Tasks Log Analysis Query analytics Related Searches Relevance assessments Classification and Clustering Mahout â http://mahout.apache.org HBase and other stores for documents Avro, Thrift, Protocol Buffers for serialization of objects across the wire