Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Solr Masterclass Bangkok, June 2014

2.981 Aufrufe

Veröffentlicht am

Presentation given to mostly Thai audience in Bangkok, June 2014.

Veröffentlicht in: Internet, Technologie

Solr Masterclass Bangkok, June 2014

  1. 1. Apache Solr Masterclass From zero to hero June 2014 www.slideshare.net/arafalov/solr-masterclass-bangkok-june-2014
  2. 2. 2 Alexandre Rafalovitch www.outerthoughts.com
  3. 3. Web search engines ! are quite sophisticated 3
  4. 4. 4
  5. 5. But the real search needs ! are! much DEEPER and BROADER 5
  6. 6. Searching code 6
  7. 7. Searching people and companies 7
  8. 8. Searching products 8
  9. 9. Searching library material 9
  10. 10. Searching languages 10
  11. 11. Understanding full-text search SELECT * 
 FROM database
 WHERE field LIKE ‘%word%’# This DOES NOT Scale# Instead: # break text into tokens# domain-specific processing (e.g. lower-casing)# build fast-access structures# algorithms for term, phrases, proximity search 11
  12. 12. Basic search engine features Search (Duh!): keyword, phrase, field-specific# Positive and negative terms# Sort: relevancy, recency# Pagination# Compact summary in results# SPEED 12
  13. 13. Advanced search engine features Facets/Taxonomy - based navigation with live counts# Language-specific processing# Domain-specific text processing (WiFi = Wi-Fi = WIFI)# Geographic search# More-like-this, did-you-mean, autocomplete# Scaling/Clustering# NOT web crawling - different, but related 13
  14. 14. Search engine solutions? Solr# Elastic Search# Xapian# Sphinx# Groonga# Searchdaimon# {F}lexSearch# Algolia (SaaS)# Searchify (SaaS)# ForageJS# Lunr.js# FACT-Finder# DtSearch# MarkLogic# Verity# Fast# Most databases# ! ! …AND MORE 14
  15. 15. Used with permission from SemaText Open Source Search Evolution 15
  16. 16. Secret Ingredient - Lucene Solr# Elastic Search# SwiftType# Galene (LinkedIn’s)# PyLucene (Python wrapper)# Lucene.net (C# port) Scalable, high-performance indexing# Incremental indexing# Full-text search# Information-Retrieval algorithms# Implemented in Java# Written in 1999, still going strong 16
  17. 17. Secret Ingredient - Solr Certified distributions# LucidWorks# HelioSearch# Big Data platforms# Cloudera# Hortonworks HDP# Hosted and SaaS# Amazon CloudSearch# WebSolr, SolrHQ, SearchBox Lucene full-text-search# XML and REST config# Schema/Schemaless# SolrCloud (clustering)# Caching# Near real-time# Rich-document indexing (Tika inside)# Plugins, components, processors 17
  18. 18. Solr Ecosystem sample Drupal# Project Blacklight# LuxDB# SolrMeter# CrafterCMS# Typo3# Magenta# HippoCMS# ColdFusion# SolrNet# DataStax# Dovecot# NGData Lily# Basho Riak# YaCy# Apache ManifoldCF# Apache Camel# FranzAllegrograph# BitNami Solr Stack# Carrot2! Broadleaf Commerce# Cloudera CDK! CodeLibs Fess (フェス)! Splunk# Alfresco# Rosette by BasisTech! Luwak by Flax! Quepid by OSC! TwigKit! SPM by SemaText! SILK by LucidWorks! Banana (O/S Solr Kibana) 18
  19. 19. DEMO Time 19
  20. 20. DEMO - Basic Unzip# Go to example directory# Run Solr# Import some documents from example docs# grep -l store *.xml | xargs ./post.sh# Show off Solr 4 admin panel 20
  21. 21. DEMO - Browse handler Restart Solr with -Dsolr.clustering.enabled=true# Visit http://localhost:8983/solr/browse/ # Show off# Search# Facets - Categories and Ranges# Spatial/Geo-distance# Clusters 21
  22. 22. Getting into Solr 22
  23. 23. Start for free Download, unzip, cd example; java -jar start.jar# Go through basic tutorial in docs/tutorial.html# Copy example directory, modify schema.xml until happy# If coming from ElasticSearch, look at example-schemaless# Do NOT follow this path to production# Example schema is a kitchen sink !!! Read it as a story.# <solr>/examples/solr/collection1/conf/{schema.xml|solrconfig.xml} 23
  24. 24. Simplest Solr - directory layout solr-home - point here with -Dsolr.solr.home collection1 - default collection name, without solr.xml conf - configuration directory for the collection schema.xml - defines fields and types solrconfig.xml - defines low-level configuration but also components, handlers, and chains for UpdateRequestProcessor 24
  25. 25. Simplest Solr - schema.xml <?xml version="1.0" encoding="UTF-8" ?> <schema version="1.5" name="simplest-solr"> <fieldType name="string" class=“solr.StrField"/> ! <field name="id" type="string" indexed="true" stored="true" required="true"/> <dynamicField name="*" type="string" indexed="true" stored="true" multiValued="true"/> ! <uniqueKey>id</uniqueKey> </schema> 25
  26. 26. Simplest Solr - solrconfig.xml <?xml version="1.0" encoding="UTF-8" ?> <config> <luceneMatchVersion>LUCENE_4_9</luceneMatchVersion> <requestDispatcher handleSelect="false"> <httpCaching never304="true" /> </requestDispatcher> <requestHandler name="/select" class="solr.SearchHandler" /> <requestHandler name="/update" class="solr.UpdateRequestHandler" /> <requestHandler name="/admin" class="solr.admin.AdminHandlers" /> <requestHandler name="/analysis/field" class="solr.FieldAnalysisRequestHandler" startup="lazy" /> </config> 26
  27. 27. DEMO https://github.com/arafalov/simplest-solr-config java -Dsolr.solr.home=…./simplest-solr Go to <solr>/example/exampledocs grep -l store *.xml |xargs ./post.sh (same, same) Check Admin UI Query - same, but different (multivalue, date) Schema browser 27
  28. 28. Lots of things missing Some admin UI items disabled (Ping, Files)# No Near-Real-Time or atomic/partial update# No types (apart from String)# No dynamic schema# No SolrCloud# DOES NOT MATTER. NOTYET! 28
  29. 29. Two ways of learning You can follow a path (going forward)# A tutorial# A book# Learn what it teaches# You can reach for the goal (going backwards)# Have an idea# Try to achieve it# Learn what’s on the critical path# Both are valuable. The second is harder, but gives you more. 29
  30. 30. Goal-driven Solr 1. Start with the simplest configuration that works# 2. Get something in (import data)# 3. Get something out (display data)# 4. Celebrate!! 5. Decide/Fine-tune what/how you want to find things# 6. Change the schema to match# 7. Change the import/display to match# 8. GOTO 5 (never really stops) 30
  31. 31. Getting data in curl# post.jar (in example/exampledocs); Try “java -jar post.jar -h” for help# Admin UI (core/Documents)# Clients (SolrJ, among 33 at various level of support: https://leanpub.com/solr- clients/)# Formats: XML, JSON, CSV, other formats (processed with Tika)# DataImportHandler to pull data from external sources# BigData connectors (Hadoop, Flume, etc) # BigData integrations (DataStax for Solr on Cassandra, Cloudera for Solr on HDFS) 31
  32. 32. Getting data out Curl# Web browser# Admin UI (core/Query)# Clients (ResponseWriters for JSON, XML, Python, Ruby, PHP, CSV)# UI toolkits (Cloudera HUE, TwigKit)# Internal post-processors (we saw VelocityResponseWriter at /browse)# Needs middleware or strong proxy - not secure otherwise 32
  33. 33. Celebrate! You achieved basic end-to-end test# You got Solr running# You figured out how to display it# You now know where the issues are# FIX THOSE NEXT 33
  34. 34. Fine-tune schema Solr is not friends with your data, it’s here to get your documents found.# <field name="features" stored="true" indexed="true" type="text_general" multiValued=“true"/># stored=true - that’s for you# indexed=true - that’s for Solr, where the magic happens# type=“type_name” - defines what analyser chain to use! SeeAdminUI core/Analysis# See http://www.solr-start.com/info/analyzers/ for full list 34
  35. 35. Analyzers - English <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"># <analyzer type="index"># <tokenizer class="solr.StandardTokenizerFactory"/># <filter class=“solr.StopFilterFactory" ignoreCase=“true" words=“lang/ stopwords_en.txt"/># <filter class="solr.LowerCaseFilterFactory"/># # <filter class="solr.EnglishPossessiveFilterFactory"/># <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/># <filter class=“solr.PorterStemFilterFactory”/>….# </analyzer>…. 35
  36. 36. Analyzers - Persian <fieldType name="text_fa" class="solr.TextField" positionIncrementGap="100"># <analyzer># <charFilter class="solr.PersianCharFilterFactory"/># <tokenizer class="solr.StandardTokenizerFactory"/># <filter class="solr.LowerCaseFilterFactory"/># <filter class="solr.ArabicNormalizationFilterFactory"/># <filter class="solr.PersianNormalizationFilterFactory"/># <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/ stopwords_fa.txt" /># </analyzer># </fieldType> 36
  37. 37. copyField FTW <copyField source="cat" dest="text"/># <copyField source="*_t" dest="text" maxChars="3000"/># Indexing book authors 
 “Schildt, Herbert; Wolpert, Lewis; Davies, P. “# For searching: Tokenized, case-folded, punctuation-stripped:
 schildt / herbert / wolpert / lewis / davies / p # For sorting: Untokenized, case-folded, punctuation-stripped:
 schildt herbert wolpert lewis davies p # For faceting: Primary author only, using a solr.StringField:
 Schildt, Herbert 37
  38. 38. Fine-tune search Default query parser supports Lucene search syntax:# text +compulsory -negated field:value# uses default field or explicit field# not very good for complex analysis# eDisMax supports that plus searching across many fields# Many more specialised types: https://cwiki.apache.org/ confluence/display/solr/Other+Parsers 38
  39. 39. Fine-tune indexing UpdateRequestProcessor# after you send your data to Solr # before it hits the schema# Deal with missing values, do pre-processing, identify languages, secret to schemaless mode (see example-schemaless)# Defined in solrconfig.xml, search for updateRequestProcessorChain# Full list at: http://www.solr-start.com/info/update-request- processors/ 39
  40. 40. Fine-tune display Sorting # Faceting - automatic taxonomy with counts (indexed value)# Highlighting# MoreLikeThis# Statistics# Grouping, Pivoting# Debug for troubleshooting 40
  41. 41. Documentation Solr WIKI - old but still has a lot of information# Solr Reference Guide - new; online and downloadable# http://www.solr-start.com/ - my resources of learners# http://heliosearch.org/author/joel-bernstein/ - about new features 41
  42. 42. With Solr, how far can I go? Cloudera (BigData) has > 1,000,000,000 $USD investments - opportunities?# 8M+ searches/day, 40 languages, 100ms NRT, 1024 cores, 256 shards, 32 servers on #solr at Bloomberg http://bit.ly/ 1jmG72G (via @FlaxSearch) 42
  43. 43. Hackathon 43
  44. 44. First steps Install Solr 4.9# Go through the tutorial - gives you basics and end-to-end test# Join the Slack chat (invitations are coming)# Twit #SolrMasterclassBkk , @SolrStart, if have space :-)# Attend breakout sessions# Choose your own adventure (next) 44
  45. 45. Path 1 - Solr indexing book Great for first timers# Gets you from zero to comfortable# All example are provided# If are you stuck, I will help you# Probably will not win you any prizes….. # Do it for the skills 45
  46. 46. Path 2 - Your own dataset Get it in at any costs# Get it displayed# Start iterating# Book a time slot to discuss your questions# Demo tips# Explain problem domain (what is your dataset)# Show how far you got# Discuss the challenges 46
  47. 47. Path 3 - Need a dataset Index your favourite Git repository (e.g. Solr): 
 https://github.com/arafalov/git-to-solr# Your own WordPress blog export (with DataImportHandler)# Your own hard-drive# Demo tips# How far did you get# Concentrate on displaying something cool (statistics?)# Coolest Solr feature you found 47
  48. 48. Path 4 - A bigger challenge Project Guttenberg (ask me for a copy of RDF dump)# WorldCup matches data: http://worldcup.sfg.io/ # Twitter feed (e.g. with Spring XD/Integration)# Your own photographs collection (Tika extracts metadata) 48
  49. 49. DEMO Rules There are no rules# And the prizes are not terribly important# What we are looking for is learning# Make something new out of something old# Learn a new features and show others# Learn, teach, share - everybody wins 49
  50. 50. For later 50
  51. 51. Accelerate your learning If still feel like a beginner, buy my book - seriously. That’s what it’s for# All code/data is at: https://github.com/arafalov/solr-indexing-book # Buy Solr InAction - recently and is a great reference, 
 follow @ManningBooks for discounts# Use my www.solr-start.com resources and join the mailing list 
 (I’ll do that for you this time)# Join solr-user mailing list - full of advanced hackers# Watch Lucid Revolution videos for background# Start helping out on Stack Overflow #solr# Blog what you learned, twit with #Solr 51
  52. 52. Other Search-related books Designing the Search Experience: The Information Architecture of Discovery - by a TwigKit creator +1# SearchAnalytics for Your Site: Conversations with Your Customers by Louis Rosenfeld - see also Quepid# Enterprise Search by Martin White 52
  53. 53. 53 Alexandre Rafalovitch www.outerthoughts.com

×