SlideShare a Scribd company logo
1 of 53
Download to read offline
Apache Solr
Masterclass
From zero to hero
June 2014
www.slideshare.net/arafalov/solr-masterclass-bangkok-june-2014
2
Alexandre Rafalovitch
www.outerthoughts.com
Web search engines !
are quite sophisticated
3
4
But the real search needs !
are!
much DEEPER and BROADER
5
Searching code
6
Searching people and companies
7
Searching products
8
Searching library material
9
Searching languages
10
Understanding full-text search
SELECT * 

FROM database

WHERE field LIKE ‘%word%’#
This DOES NOT Scale#
Instead: #
break text into tokens#
domain-specific processing (e.g. lower-casing)#
build fast-access structures#
algorithms for term, phrases, proximity search
11
Basic search engine features
Search (Duh!): keyword, phrase, field-specific#
Positive and negative terms#
Sort: relevancy, recency#
Pagination#
Compact summary in results#
SPEED
12
Advanced search engine features
Facets/Taxonomy - based navigation with live counts#
Language-specific processing#
Domain-specific text processing (WiFi = Wi-Fi = WIFI)#
Geographic search#
More-like-this, did-you-mean, autocomplete#
Scaling/Clustering#
NOT web crawling - different, but related
13
Search engine solutions?
Solr#
Elastic Search#
Xapian#
Sphinx#
Groonga#
Searchdaimon#
{F}lexSearch#
Algolia (SaaS)#
Searchify
(SaaS)#
ForageJS#
Lunr.js#
FACT-Finder#
DtSearch#
MarkLogic#
Verity#
Fast#
Most databases#
!
!
…AND MORE
14
Used with permission from SemaText
Open Source Search Evolution
15
Secret Ingredient - Lucene
Solr#
Elastic Search#
SwiftType#
Galene (LinkedIn’s)#
PyLucene (Python
wrapper)#
Lucene.net (C# port)
Scalable, high-performance
indexing#
Incremental indexing#
Full-text search#
Information-Retrieval
algorithms#
Implemented in Java#
Written in 1999, still going
strong
16
Secret Ingredient - Solr
Certified distributions#
LucidWorks#
HelioSearch#
Big Data platforms#
Cloudera#
Hortonworks HDP#
Hosted and SaaS#
Amazon CloudSearch#
WebSolr, SolrHQ, SearchBox
Lucene full-text-search#
XML and REST config#
Schema/Schemaless#
SolrCloud (clustering)#
Caching#
Near real-time#
Rich-document indexing (Tika
inside)#
Plugins, components, processors
17
Solr Ecosystem sample
Drupal#
Project Blacklight#
LuxDB#
SolrMeter#
CrafterCMS#
Typo3#
Magenta#
HippoCMS#
ColdFusion#
SolrNet#
DataStax#
Dovecot#
NGData Lily#
Basho Riak#
YaCy#
Apache ManifoldCF#
Apache Camel#
FranzAllegrograph#
BitNami Solr Stack#
Carrot2!
Broadleaf Commerce#
Cloudera CDK!
CodeLibs Fess (フェス)!
Splunk#
Alfresco#
Rosette by BasisTech!
Luwak by Flax!
Quepid by OSC!
TwigKit!
SPM by SemaText!
SILK by LucidWorks!
Banana (O/S Solr
Kibana)
18
DEMO Time
19
DEMO - Basic
Unzip#
Go to example directory#
Run Solr#
Import some documents from example docs#
grep -l store *.xml | xargs ./post.sh#
Show off Solr 4 admin panel
20
DEMO - Browse handler
Restart Solr with -Dsolr.clustering.enabled=true#
Visit http://localhost:8983/solr/browse/ #
Show off#
Search#
Facets - Categories and Ranges#
Spatial/Geo-distance#
Clusters
21
Getting into Solr
22
Start for free
Download, unzip, cd example; java -jar start.jar#
Go through basic tutorial in docs/tutorial.html#
Copy example directory, modify schema.xml until happy#
If coming from ElasticSearch, look at example-schemaless#
Do NOT follow this path to production#
Example schema is a kitchen sink !!! Read it as a story.#
<solr>/examples/solr/collection1/conf/{schema.xml|solrconfig.xml}
23
Simplest Solr - directory layout
solr-home - point here with -Dsolr.solr.home
collection1 - default collection name, without solr.xml
conf - configuration directory for the collection
schema.xml - defines fields and types
solrconfig.xml - defines low-level configuration but also
components, handlers, and chains for UpdateRequestProcessor
24
Simplest Solr - schema.xml
<?xml version="1.0" encoding="UTF-8" ?>
<schema version="1.5" name="simplest-solr">
<fieldType name="string" class=“solr.StrField"/>
!
<field name="id" type="string" indexed="true" stored="true"
required="true"/>
<dynamicField name="*" type="string" indexed="true"
stored="true" multiValued="true"/>
!
<uniqueKey>id</uniqueKey>
</schema>
25
Simplest Solr - solrconfig.xml
<?xml version="1.0" encoding="UTF-8" ?>
<config>
<luceneMatchVersion>LUCENE_4_9</luceneMatchVersion>
<requestDispatcher handleSelect="false">
<httpCaching never304="true" />
</requestDispatcher>
<requestHandler name="/select" class="solr.SearchHandler" />
<requestHandler name="/update" class="solr.UpdateRequestHandler" />
<requestHandler name="/admin" class="solr.admin.AdminHandlers" />
<requestHandler name="/analysis/field"
class="solr.FieldAnalysisRequestHandler" startup="lazy" />
</config>
26
DEMO
https://github.com/arafalov/simplest-solr-config
java -Dsolr.solr.home=…./simplest-solr
Go to <solr>/example/exampledocs
grep -l store *.xml |xargs ./post.sh (same, same)
Check Admin UI
Query - same, but different (multivalue, date)
Schema browser
27
Lots of things missing
Some admin UI items disabled (Ping, Files)#
No Near-Real-Time or atomic/partial update#
No types (apart from String)#
No dynamic schema#
No SolrCloud#
DOES NOT MATTER. NOTYET!
28
Two ways of learning
You can follow a path (going forward)#
A tutorial#
A book#
Learn what it teaches#
You can reach for the goal (going backwards)#
Have an idea#
Try to achieve it#
Learn what’s on the critical path#
Both are valuable. The second is harder, but gives you more.
29
Goal-driven Solr
1. Start with the simplest configuration that works#
2. Get something in (import data)#
3. Get something out (display data)#
4. Celebrate!!
5. Decide/Fine-tune what/how you want to find things#
6. Change the schema to match#
7. Change the import/display to match#
8. GOTO 5 (never really stops)
30
Getting data in
curl#
post.jar (in example/exampledocs); Try “java -jar post.jar -h” for help#
Admin UI (core/Documents)#
Clients (SolrJ, among 33 at various level of support: https://leanpub.com/solr-
clients/)#
Formats: XML, JSON, CSV, other formats (processed with Tika)#
DataImportHandler to pull data from external sources#
BigData connectors (Hadoop, Flume, etc) #
BigData integrations (DataStax for Solr on Cassandra, Cloudera for Solr on
HDFS)
31
Getting data out
Curl#
Web browser#
Admin UI (core/Query)#
Clients (ResponseWriters for JSON, XML, Python, Ruby, PHP,
CSV)#
UI toolkits (Cloudera HUE, TwigKit)#
Internal post-processors (we saw VelocityResponseWriter at /browse)#
Needs middleware or strong proxy - not secure otherwise
32
Celebrate!
You achieved basic end-to-end test#
You got Solr running#
You figured out how to display it#
You now know where the issues are#
FIX THOSE NEXT
33
Fine-tune schema
Solr is not friends with your data, it’s here to get your documents
found.#
<field name="features" stored="true" indexed="true"
type="text_general" multiValued=“true"/>#
stored=true - that’s for you#
indexed=true - that’s for Solr, where the magic happens#
type=“type_name” - defines what analyser chain to use!
SeeAdminUI core/Analysis#
See http://www.solr-start.com/info/analyzers/ for full list
34
Analyzers - English
<fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">#
<analyzer type="index">#
<tokenizer class="solr.StandardTokenizerFactory"/>#
<filter class=“solr.StopFilterFactory" ignoreCase=“true" words=“lang/
stopwords_en.txt"/>#
<filter class="solr.LowerCaseFilterFactory"/>#
# <filter class="solr.EnglishPossessiveFilterFactory"/>#
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>#
<filter class=“solr.PorterStemFilterFactory”/>….#
</analyzer>….
35
Analyzers - Persian
<fieldType name="text_fa" class="solr.TextField"
positionIncrementGap="100">#
<analyzer>#
<charFilter class="solr.PersianCharFilterFactory"/>#
<tokenizer class="solr.StandardTokenizerFactory"/>#
<filter class="solr.LowerCaseFilterFactory"/>#
<filter class="solr.ArabicNormalizationFilterFactory"/>#
<filter class="solr.PersianNormalizationFilterFactory"/>#
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/
stopwords_fa.txt" />#
</analyzer>#
</fieldType>
36
copyField FTW
<copyField source="cat" dest="text"/>#
<copyField source="*_t" dest="text" maxChars="3000"/>#
Indexing book authors 

“Schildt, Herbert; Wolpert, Lewis; Davies, P. “#
For searching: Tokenized, case-folded, punctuation-stripped:

schildt / herbert / wolpert / lewis / davies / p #
For sorting: Untokenized, case-folded, punctuation-stripped:

schildt herbert wolpert lewis davies p #
For faceting: Primary author only, using a solr.StringField:

Schildt, Herbert
37
Fine-tune search
Default query parser supports Lucene search syntax:#
text +compulsory -negated field:value#
uses default field or explicit field#
not very good for complex analysis#
eDisMax supports that plus searching across many fields#
Many more specialised types: https://cwiki.apache.org/
confluence/display/solr/Other+Parsers
38
Fine-tune indexing
UpdateRequestProcessor#
after you send your data to Solr #
before it hits the schema#
Deal with missing values, do pre-processing, identify
languages, secret to schemaless mode (see example-schemaless)#
Defined in solrconfig.xml, search for
updateRequestProcessorChain#
Full list at: http://www.solr-start.com/info/update-request-
processors/
39
Fine-tune display
Sorting #
Faceting - automatic taxonomy with counts (indexed value)#
Highlighting#
MoreLikeThis#
Statistics#
Grouping, Pivoting#
Debug for troubleshooting
40
Documentation
Solr WIKI - old but still has a lot of information#
Solr Reference Guide - new; online and downloadable#
http://www.solr-start.com/ - my resources of learners#
http://heliosearch.org/author/joel-bernstein/ - about new
features
41
With Solr, how far can I go?
Cloudera (BigData) has > 1,000,000,000 $USD
investments - opportunities?#
8M+ searches/day, 40 languages, 100ms NRT, 1024 cores,
256 shards, 32 servers on #solr at Bloomberg http://bit.ly/
1jmG72G (via @FlaxSearch)
42
Hackathon
43
First steps
Install Solr 4.9#
Go through the tutorial - gives you basics and end-to-end test#
Join the Slack chat (invitations are coming)#
Twit #SolrMasterclassBkk , @SolrStart, if have space :-)#
Attend breakout sessions#
Choose your own adventure (next)
44
Path 1 - Solr indexing book
Great for first timers#
Gets you from zero to comfortable#
All example are provided#
If are you stuck, I will help you#
Probably will not win you any prizes….. #
Do it for the skills
45
Path 2 - Your own dataset
Get it in at any costs#
Get it displayed#
Start iterating#
Book a time slot to discuss your questions#
Demo tips#
Explain problem domain (what is your dataset)#
Show how far you got#
Discuss the challenges
46
Path 3 - Need a dataset
Index your favourite Git repository (e.g. Solr): 

https://github.com/arafalov/git-to-solr#
Your own WordPress blog export (with DataImportHandler)#
Your own hard-drive#
Demo tips#
How far did you get#
Concentrate on displaying something cool (statistics?)#
Coolest Solr feature you found
47
Path 4 - A bigger challenge
Project Guttenberg (ask me for a copy of RDF dump)#
WorldCup matches data: http://worldcup.sfg.io/ #
Twitter feed (e.g. with Spring XD/Integration)#
Your own photographs collection (Tika extracts metadata)
48
DEMO Rules
There are no rules#
And the prizes are not terribly important#
What we are looking for is learning#
Make something new out of something old#
Learn a new features and show others#
Learn, teach, share - everybody wins
49
For later
50
Accelerate your learning
If still feel like a beginner, buy my book - seriously. That’s what it’s for#
All code/data is at: https://github.com/arafalov/solr-indexing-book #
Buy Solr InAction - recently and is a great reference, 

follow @ManningBooks for discounts#
Use my www.solr-start.com resources and join the mailing list 

(I’ll do that for you this time)#
Join solr-user mailing list - full of advanced hackers#
Watch Lucid Revolution videos for background#
Start helping out on Stack Overflow #solr#
Blog what you learned, twit with #Solr
51
Other Search-related books
Designing the Search Experience: The Information
Architecture of Discovery - by a TwigKit creator +1#
SearchAnalytics for Your Site: Conversations with Your
Customers by Louis Rosenfeld - see also Quepid#
Enterprise Search by Martin White
52
53
Alexandre Rafalovitch
www.outerthoughts.com

More Related Content

What's hot

20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
Chris Huang
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
Erik Hatcher
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conference
Erik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 

What's hot (20)

Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature Preview
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
 
Mastering solr
Mastering solrMastering solr
Mastering solr
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Apache Solr + ajax solr
Apache Solr + ajax solrApache Solr + ajax solr
Apache Solr + ajax solr
 
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conference
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
 
Solr workshop
Solr workshopSolr workshop
Solr workshop
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
it's just search
it's just searchit's just search
it's just search
 
Solr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksSolr Indexing and Analysis Tricks
Solr Indexing and Analysis Tricks
 
From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Solr Introduction
Solr IntroductionSolr Introduction
Solr Introduction
 

Similar to Solr Masterclass Bangkok, June 2014

Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
Erik Hatcher
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
Sourcesense
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 

Similar to Solr Masterclass Bangkok, June 2014 (20)

Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
 
Apache Solr for begginers
Apache Solr for begginersApache Solr for begginers
Apache Solr for begginers
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science Bootcamp
 
20150210 solr introdution
20150210 solr introdution20150210 solr introdution
20150210 solr introdution
 
Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8
 
Solr introduction
Solr introductionSolr introduction
Solr introduction
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
 
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentOpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
 
Rails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search EngineRails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search Engine
 
Small wins in a small time with Apache Solr
Small wins in a small time with Apache SolrSmall wins in a small time with Apache Solr
Small wins in a small time with Apache Solr
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 
Apache solr
Apache solrApache solr
Apache solr
 

Recently uploaded

一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
F
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
ydyuyu
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
ayvbos
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
ayvbos
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
F
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
pxcywzqs
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Monica Sydney
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
JOHNBEBONYAP1
 

Recently uploaded (20)

APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
 
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsMira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
 
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime BalliaBallia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
Call girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girlsCall girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girls
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 

Solr Masterclass Bangkok, June 2014

  • 1. Apache Solr Masterclass From zero to hero June 2014 www.slideshare.net/arafalov/solr-masterclass-bangkok-june-2014
  • 3. Web search engines ! are quite sophisticated 3
  • 4. 4
  • 5. But the real search needs ! are! much DEEPER and BROADER 5
  • 7. Searching people and companies 7
  • 11. Understanding full-text search SELECT * 
 FROM database
 WHERE field LIKE ‘%word%’# This DOES NOT Scale# Instead: # break text into tokens# domain-specific processing (e.g. lower-casing)# build fast-access structures# algorithms for term, phrases, proximity search 11
  • 12. Basic search engine features Search (Duh!): keyword, phrase, field-specific# Positive and negative terms# Sort: relevancy, recency# Pagination# Compact summary in results# SPEED 12
  • 13. Advanced search engine features Facets/Taxonomy - based navigation with live counts# Language-specific processing# Domain-specific text processing (WiFi = Wi-Fi = WIFI)# Geographic search# More-like-this, did-you-mean, autocomplete# Scaling/Clustering# NOT web crawling - different, but related 13
  • 14. Search engine solutions? Solr# Elastic Search# Xapian# Sphinx# Groonga# Searchdaimon# {F}lexSearch# Algolia (SaaS)# Searchify (SaaS)# ForageJS# Lunr.js# FACT-Finder# DtSearch# MarkLogic# Verity# Fast# Most databases# ! ! …AND MORE 14
  • 15. Used with permission from SemaText Open Source Search Evolution 15
  • 16. Secret Ingredient - Lucene Solr# Elastic Search# SwiftType# Galene (LinkedIn’s)# PyLucene (Python wrapper)# Lucene.net (C# port) Scalable, high-performance indexing# Incremental indexing# Full-text search# Information-Retrieval algorithms# Implemented in Java# Written in 1999, still going strong 16
  • 17. Secret Ingredient - Solr Certified distributions# LucidWorks# HelioSearch# Big Data platforms# Cloudera# Hortonworks HDP# Hosted and SaaS# Amazon CloudSearch# WebSolr, SolrHQ, SearchBox Lucene full-text-search# XML and REST config# Schema/Schemaless# SolrCloud (clustering)# Caching# Near real-time# Rich-document indexing (Tika inside)# Plugins, components, processors 17
  • 18. Solr Ecosystem sample Drupal# Project Blacklight# LuxDB# SolrMeter# CrafterCMS# Typo3# Magenta# HippoCMS# ColdFusion# SolrNet# DataStax# Dovecot# NGData Lily# Basho Riak# YaCy# Apache ManifoldCF# Apache Camel# FranzAllegrograph# BitNami Solr Stack# Carrot2! Broadleaf Commerce# Cloudera CDK! CodeLibs Fess (フェス)! Splunk# Alfresco# Rosette by BasisTech! Luwak by Flax! Quepid by OSC! TwigKit! SPM by SemaText! SILK by LucidWorks! Banana (O/S Solr Kibana) 18
  • 20. DEMO - Basic Unzip# Go to example directory# Run Solr# Import some documents from example docs# grep -l store *.xml | xargs ./post.sh# Show off Solr 4 admin panel 20
  • 21. DEMO - Browse handler Restart Solr with -Dsolr.clustering.enabled=true# Visit http://localhost:8983/solr/browse/ # Show off# Search# Facets - Categories and Ranges# Spatial/Geo-distance# Clusters 21
  • 23. Start for free Download, unzip, cd example; java -jar start.jar# Go through basic tutorial in docs/tutorial.html# Copy example directory, modify schema.xml until happy# If coming from ElasticSearch, look at example-schemaless# Do NOT follow this path to production# Example schema is a kitchen sink !!! Read it as a story.# <solr>/examples/solr/collection1/conf/{schema.xml|solrconfig.xml} 23
  • 24. Simplest Solr - directory layout solr-home - point here with -Dsolr.solr.home collection1 - default collection name, without solr.xml conf - configuration directory for the collection schema.xml - defines fields and types solrconfig.xml - defines low-level configuration but also components, handlers, and chains for UpdateRequestProcessor 24
  • 25. Simplest Solr - schema.xml <?xml version="1.0" encoding="UTF-8" ?> <schema version="1.5" name="simplest-solr"> <fieldType name="string" class=“solr.StrField"/> ! <field name="id" type="string" indexed="true" stored="true" required="true"/> <dynamicField name="*" type="string" indexed="true" stored="true" multiValued="true"/> ! <uniqueKey>id</uniqueKey> </schema> 25
  • 26. Simplest Solr - solrconfig.xml <?xml version="1.0" encoding="UTF-8" ?> <config> <luceneMatchVersion>LUCENE_4_9</luceneMatchVersion> <requestDispatcher handleSelect="false"> <httpCaching never304="true" /> </requestDispatcher> <requestHandler name="/select" class="solr.SearchHandler" /> <requestHandler name="/update" class="solr.UpdateRequestHandler" /> <requestHandler name="/admin" class="solr.admin.AdminHandlers" /> <requestHandler name="/analysis/field" class="solr.FieldAnalysisRequestHandler" startup="lazy" /> </config> 26
  • 27. DEMO https://github.com/arafalov/simplest-solr-config java -Dsolr.solr.home=…./simplest-solr Go to <solr>/example/exampledocs grep -l store *.xml |xargs ./post.sh (same, same) Check Admin UI Query - same, but different (multivalue, date) Schema browser 27
  • 28. Lots of things missing Some admin UI items disabled (Ping, Files)# No Near-Real-Time or atomic/partial update# No types (apart from String)# No dynamic schema# No SolrCloud# DOES NOT MATTER. NOTYET! 28
  • 29. Two ways of learning You can follow a path (going forward)# A tutorial# A book# Learn what it teaches# You can reach for the goal (going backwards)# Have an idea# Try to achieve it# Learn what’s on the critical path# Both are valuable. The second is harder, but gives you more. 29
  • 30. Goal-driven Solr 1. Start with the simplest configuration that works# 2. Get something in (import data)# 3. Get something out (display data)# 4. Celebrate!! 5. Decide/Fine-tune what/how you want to find things# 6. Change the schema to match# 7. Change the import/display to match# 8. GOTO 5 (never really stops) 30
  • 31. Getting data in curl# post.jar (in example/exampledocs); Try “java -jar post.jar -h” for help# Admin UI (core/Documents)# Clients (SolrJ, among 33 at various level of support: https://leanpub.com/solr- clients/)# Formats: XML, JSON, CSV, other formats (processed with Tika)# DataImportHandler to pull data from external sources# BigData connectors (Hadoop, Flume, etc) # BigData integrations (DataStax for Solr on Cassandra, Cloudera for Solr on HDFS) 31
  • 32. Getting data out Curl# Web browser# Admin UI (core/Query)# Clients (ResponseWriters for JSON, XML, Python, Ruby, PHP, CSV)# UI toolkits (Cloudera HUE, TwigKit)# Internal post-processors (we saw VelocityResponseWriter at /browse)# Needs middleware or strong proxy - not secure otherwise 32
  • 33. Celebrate! You achieved basic end-to-end test# You got Solr running# You figured out how to display it# You now know where the issues are# FIX THOSE NEXT 33
  • 34. Fine-tune schema Solr is not friends with your data, it’s here to get your documents found.# <field name="features" stored="true" indexed="true" type="text_general" multiValued=“true"/># stored=true - that’s for you# indexed=true - that’s for Solr, where the magic happens# type=“type_name” - defines what analyser chain to use! SeeAdminUI core/Analysis# See http://www.solr-start.com/info/analyzers/ for full list 34
  • 35. Analyzers - English <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"># <analyzer type="index"># <tokenizer class="solr.StandardTokenizerFactory"/># <filter class=“solr.StopFilterFactory" ignoreCase=“true" words=“lang/ stopwords_en.txt"/># <filter class="solr.LowerCaseFilterFactory"/># # <filter class="solr.EnglishPossessiveFilterFactory"/># <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/># <filter class=“solr.PorterStemFilterFactory”/>….# </analyzer>…. 35
  • 36. Analyzers - Persian <fieldType name="text_fa" class="solr.TextField" positionIncrementGap="100"># <analyzer># <charFilter class="solr.PersianCharFilterFactory"/># <tokenizer class="solr.StandardTokenizerFactory"/># <filter class="solr.LowerCaseFilterFactory"/># <filter class="solr.ArabicNormalizationFilterFactory"/># <filter class="solr.PersianNormalizationFilterFactory"/># <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/ stopwords_fa.txt" /># </analyzer># </fieldType> 36
  • 37. copyField FTW <copyField source="cat" dest="text"/># <copyField source="*_t" dest="text" maxChars="3000"/># Indexing book authors 
 “Schildt, Herbert; Wolpert, Lewis; Davies, P. “# For searching: Tokenized, case-folded, punctuation-stripped:
 schildt / herbert / wolpert / lewis / davies / p # For sorting: Untokenized, case-folded, punctuation-stripped:
 schildt herbert wolpert lewis davies p # For faceting: Primary author only, using a solr.StringField:
 Schildt, Herbert 37
  • 38. Fine-tune search Default query parser supports Lucene search syntax:# text +compulsory -negated field:value# uses default field or explicit field# not very good for complex analysis# eDisMax supports that plus searching across many fields# Many more specialised types: https://cwiki.apache.org/ confluence/display/solr/Other+Parsers 38
  • 39. Fine-tune indexing UpdateRequestProcessor# after you send your data to Solr # before it hits the schema# Deal with missing values, do pre-processing, identify languages, secret to schemaless mode (see example-schemaless)# Defined in solrconfig.xml, search for updateRequestProcessorChain# Full list at: http://www.solr-start.com/info/update-request- processors/ 39
  • 40. Fine-tune display Sorting # Faceting - automatic taxonomy with counts (indexed value)# Highlighting# MoreLikeThis# Statistics# Grouping, Pivoting# Debug for troubleshooting 40
  • 41. Documentation Solr WIKI - old but still has a lot of information# Solr Reference Guide - new; online and downloadable# http://www.solr-start.com/ - my resources of learners# http://heliosearch.org/author/joel-bernstein/ - about new features 41
  • 42. With Solr, how far can I go? Cloudera (BigData) has > 1,000,000,000 $USD investments - opportunities?# 8M+ searches/day, 40 languages, 100ms NRT, 1024 cores, 256 shards, 32 servers on #solr at Bloomberg http://bit.ly/ 1jmG72G (via @FlaxSearch) 42
  • 44. First steps Install Solr 4.9# Go through the tutorial - gives you basics and end-to-end test# Join the Slack chat (invitations are coming)# Twit #SolrMasterclassBkk , @SolrStart, if have space :-)# Attend breakout sessions# Choose your own adventure (next) 44
  • 45. Path 1 - Solr indexing book Great for first timers# Gets you from zero to comfortable# All example are provided# If are you stuck, I will help you# Probably will not win you any prizes….. # Do it for the skills 45
  • 46. Path 2 - Your own dataset Get it in at any costs# Get it displayed# Start iterating# Book a time slot to discuss your questions# Demo tips# Explain problem domain (what is your dataset)# Show how far you got# Discuss the challenges 46
  • 47. Path 3 - Need a dataset Index your favourite Git repository (e.g. Solr): 
 https://github.com/arafalov/git-to-solr# Your own WordPress blog export (with DataImportHandler)# Your own hard-drive# Demo tips# How far did you get# Concentrate on displaying something cool (statistics?)# Coolest Solr feature you found 47
  • 48. Path 4 - A bigger challenge Project Guttenberg (ask me for a copy of RDF dump)# WorldCup matches data: http://worldcup.sfg.io/ # Twitter feed (e.g. with Spring XD/Integration)# Your own photographs collection (Tika extracts metadata) 48
  • 49. DEMO Rules There are no rules# And the prizes are not terribly important# What we are looking for is learning# Make something new out of something old# Learn a new features and show others# Learn, teach, share - everybody wins 49
  • 51. Accelerate your learning If still feel like a beginner, buy my book - seriously. That’s what it’s for# All code/data is at: https://github.com/arafalov/solr-indexing-book # Buy Solr InAction - recently and is a great reference, 
 follow @ManningBooks for discounts# Use my www.solr-start.com resources and join the mailing list 
 (I’ll do that for you this time)# Join solr-user mailing list - full of advanced hackers# Watch Lucid Revolution videos for background# Start helping out on Stack Overflow #solr# Blog what you learned, twit with #Solr 51
  • 52. Other Search-related books Designing the Search Experience: The Information Architecture of Discovery - by a TwigKit creator +1# SearchAnalytics for Your Site: Conversations with Your Customers by Louis Rosenfeld - see also Quepid# Enterprise Search by Martin White 52