Presentation held at Oslo Enterprise MeetUp in May, pitched towards an audience who come from the FAST ESP side and have some existing FAST knowledge. Check out one of my other presentations if you're most familiar with Lucene/Solr.
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
1. cominvent as
Enterprise Search Specialists
Migrating FAST to Solr
By Jan Høydahl
Oslo Enterprise Search MeetUp May 2010
cominvent as
2. Jan Høydahl
● IT architect - search,
telecom, mobile
● Helped build FAST's Global
Services as first engineer
● Founder of Cominvent AS
● Search consultant 10 years
cominvent as
4. Consulting
– Cominvent delivers independent search consulting
– Focus on Apache Lucene/Solr & Microsoft FAST ESP
Idea –> architecture –> implementation
cominvent as
5. Commercial Support (Solr/Lucene)
– When community & mailing list support is not enough..
– Paid support agreement for Apache Solr/Lucene
– In cooperation with Lucid Imagination
– Read more: http://www.cominvent.com/support/
cominvent as
6. Training
– Cominvent AS delivers training public and on-site
– Certified Solr Training Partner for Lucid Imagination
– Certified FAST ESP Training Partner
– Read more: http://www.cominvent.com/training/
cominvent as
Photo: fluidpowerzone.com
14. Apache Solr - characteristics
Search server
(Commercially friendly)
cominvent as
15. Apache Solr - characteristics
Modular Community
Contributions & patches
Light weight
cominvent as
16. Solr-user community growth
Solr-user growth
1600
1400
1200
1000
Messages
800
Column B
600
400
200
0
2006 Mar 2006 Jul 2006 Nov 2007 Mar 2007 Jul 2007 Nov 2008 Mar 2008 Jul 2008 Nov 2009 Apr 2009 Aug 2009 Dec
2006 Jan 2006 May 2006 Sep 2007 Jan 2007 May 2007 Sep 2008 Jan 2008 May 2008 Sep 2009 Feb 2009 Jun 2009 Oct 2010 Feb
cominvent as Month
17. Lucene/Solr deployments
– More: http://wiki.apache.org/solr/PublicServers
cominvent as
Thanks to Lucid Imagination for logo collection
23. FAST ESP – characteristics & key strengths
Security
Connectors
cominvent as
24. FAST ESP – characteristics & key strengths
cominvent as
25. FAST ESP – characteristics & key strengths
– Very strong document processing framework
Format Language Linguistic
Conversion Detection Normalization Entities
Custom
Taxonomy Sentiment Ontology
Plug-in
PARIS (Reuters) - Venus Williams raced into the second
round of the $11.25 million French Open Monday,
Search Alert brushing aside Bianka Lamade, 6-3, 6-3, in 65 minutes.
The Wimbledon and U.S. Open champion, seeded second,
breezed past the German on a blustery center court to
become the first seed to advance at Roland Garros.
"I love being here, I love the French Open and more than
anything I'd love to do well here," the American said.
A first round loser last year, Williams is hoping to progress
cominvent as beyond the quarter-finals for the first time in her career.
28. Migration objectives
– Possible objectives include:
• Lower maintenance cost
• Deeper in-house competency
• Less dependent on external consultants
• Ownership and visibility of source code
• Shorter time to market for new features
• Bugs fixed faster – or even fix ourselves
• Larger community, mailing lists that work!
• More choice in external consultants
• Contribute back to Open Source
• Lower HW footprint
cominvent as
29. Migration steps
– Knowledge gathering & Training
– Review current features & arch
• Want to keep all features? Add new?
– Migration areas:
• Index profile
• Content
• Feeding
• Document Processing
• Querying
• Search middleware?
• Admin & Operational
– What to do in Application space vs Search space?
cominvent as
30. Feature comparison ESP – Solr (similarities)
Feature ESP Solr
Full-text, boolean, range search, Yes Yes
sorting, sub-second, facets, did-you-
mean, synonyms, faceting
Scaling for QPS Add rows Add rows
Scaling for document volume Add columns Add shards
Synonyms Index/query side Index/query side
GEO search Yes Yes (1.5)
Boolean query language Yes (FQL) Yes (Lucene or
(e)DisMax)
APIs HTTP, Java, .NET, HTTP, Java, .NET,
C++, PHP Ruby, Python, PHP,
Perl, JS
cominvent as
31. Feature comparison ESP – Solr (differences)
Feature ESP Solr
Admin server Yes No (coming 1.5)
Processes Many (C++, Java, One WAR in Java
Python) app-server, 100%
Java
Navigators / Facets Index-time Query-time
Did-you-mean Dictionary based Dictionary or
index based
Feeding API only HTTP POST or API
Document processing Pipeline (py) Simple pipeline
(Java, JS, Groovy,
Jython, JRuby..)
Multi field querying Composite fields DisMax handler
cominvent as
32. Feature comparison ESP – Solr (differences)
Feature ESP Solr
Relevancy tuning Rank profiles, term Dynamic function
boosting queries and boost
functions
XRANK XRANK operator Function Queries
Freshness boost Freshness in rank Function Queries
profile
Boost GEO distance Rank profile and Function Queries
special
Major schema or software updates Cold update, use Stage new content
stage environment into new Solr core
Pluggability Docprocs, QT/RP Everything :)
(limited), clients Request Handlers,
Query Parsers,
Docprocs, Rank,
Spell, tokenizer++
cominvent as
33. Feature comparison ESP – Solr (differences)
Feature ESP Solr
Lemmatization Can be licensed Can be licensed
for many from 3rd party
languages
Query syntax and(a:foo, b:bar) a:foo OR b:bar
i:range(0, 100) I:[0 TO 100]
d:range(2000-01- d:[2000-01-
01T00:00:00, 01T00:00:00Z TO
2010-03- NOW]
03T12:00:00)
Query params query= q=
offset= start=
hits= rows=
spell=1 spellcheck=true
What fields to return view=viewname fl=title,price,body...
cominvent as
34. Feature comparison ESP – Solr (differences)
Feature ESP Solr
Search XML hierarchy Yes, scope search No
Reports Built in analytics Use 3rd party log
analysis such as
Splunk.com
cominvent as
35. Your existing FAST system - overview
Your web-app
Search middleware?
cominvent as
Graphics diagram: www.microsoft.com
36. Migrating index profile
– ESP index profile -> Solr schema.xml
– Setup field types, use defaults or create your own
– Setup the static fields. ESP:
– Solr equivalent:
– No need for generic*, use dynamic fields:
cominvent as
37. Migrating index profile
– Composite fields?
• Solr can use <copyField> to copy multiple fields into
one, e.g. as we did to map many attributes into one
field
• However, to achieve ranking with different boost of
each field, Solr does not need composite field. Use
DisMax query handler instead. Very powerful!
– No need to edit schema to add new fields. Using
dynamic fields, it is easy to e.g. Introduce a color facet
for cars or a Mpixels facet for digital cameras
cominvent as
38. DisMax query example
– This Solr query can replace use of composite-field
• qt=dismax
• q=oslo
• qf=title^0.7 highpriorityfields^1.5
mediumpriorityfields^0.6 lowpriorityfields^0.2
recallfields^0.0 body^0.0
• bf=recip(rord(creationDate),1,1000,1000)
cominvent as
39. Migrating content
– If using FAST ContentAPI to push programatically
• Use Solr's clients (Java, .NET, Ruby, Python, PHP...)
– If feeding FastXML using FileTraverser
• Feed as Solr XML using HTTP POST or a POST client
– If you feed custom XML with XMLMapper
• Have a look at DIH's import and mapping features
cominvent as
40. Push Feeding example
– Feed XML using HTTP POST:
• curl http://localhost:8080/solr/update?commit=true
-H "Content-Type: text/xml"
--data-binary @mydoc.xml
– Ruby example:
• >gem sources -a http://gemcutter.org
>sudo gem install rsolr
require 'rsolr'
solr = RSolr.connect :url=>'http://localhost:8080'
documents = [{:id=>1, :price=>1.00},
{:id=>2, :price=>10.50}]
solr.add documents
solr.commit
cominvent as
42. Querying examples
– http://localhost:8080/solr/select?q=car&fl=id,title
– Ruby
• res=solr.select :q=>'roses', :fq=>['red','white']
res['response']['docs'].each do |doc|
puts doc['title']
end
cominvent as
43. Migrating document processing
– Solr lacks a sophisticated pipeline with entity
extraction etc. Alternatives:
• Do extraction in Application space (Ruby)
• Write own stage in Solr pipeline for simple cases
• Integrate to do more advanced stuff
– Matchers/extractors
• LingPipe NamedEntityExtractor inside of OpenPipeline
– Synonyms:
• Use Solr's synonym handling index/query side
– Custom stages:
• Write a Solr UpdateProcessor (in Java, Jython etc)
– Got a LOT of custom FAST docproc stages?
• Have a look at SESAT's PY ProcServer for Solr (GPL)
cominvent as
44. Migrating linguistics (lemmatization)
– Solr ships with Stemming instead of Lemmatization
– Stemming has limitations
• Biler, bilen, bilene -> bil
BUT
• Bøker, bøkene -> bøk; boka, bok -> bok
– Kstem better. Free with LucidWorks for Solr
– If you need singular/plural handling only
• Free dictionaries? Check lucene-hunspell
– Lemmatization can be licensed from 3rd party
such as Basistech, who also has language
identification & entity extraction
– Language identification also from Sematext
cominvent as
45. Basistech Rosette for Lucene
– High-end linguistics capabilities for
19 languages
– Language Identification
– Segmentation and tokenization
– Lemmatization
– Noun decompounding
– Part-of-speech tagging
– Entity extraction
– Easily integrated with Lucene/Solr
– More: http://www.basistech.com/lucene/
cominvent as
46. Migrating search middleware
– Using FAST Unity?
• Consider migrating middleware logic such as external
source querying and federation to SESAT (AGPL)
– Using Comperio Front?
• Ask Comperio for Solr engine support
• Or migrate custom Q&R formats
– Or is plain Solr enough?
• Solr has built-in support for shards
• A shard query will query multiple shards
and merge the results into one
• Add custom processing as Query
Components in Solr
• Check contrib & patches!
cominvent as
47. Migrating Front ends
– Using a middleware with Solr support? Lucky you!
– If not, consider introducing one now. Look at (Java):
– If you decide to migrate from FAST Java/.NET APIs
• Choose SolrJ or SolrNET
• Query language differences. &fq= instead of filter()
• Solr facets do not require sessions/state as FAST's
– Migrate fast's «views» into named ReqHandler configs
– Multi lingual: Need to handle title_no, title_en etc... :(
cominvent as
48. Migrating Web Crawler
– Solr has no built-in web crawler
• Instead you can choose from several integrations
– The Apache Nutch crawler
• Proven with hundreds of millions of pages
• http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
– Apache Droids
• Still an incubator, but aims at becoming a full crawler
• http://incubator.apache.org/droids/
– Heritix + Solr (example in Solr1.4 book)
– OpenPipeline has a (very) simple crawler
– Lucene Connectors Framework
• Preparing crawler support
cominvent as
49. Migrating Connectors
– Solr handles these sources internally through DIH:
• Database, RSS, Web-services, Local filesystem
– Additionally throgh Lucene Connectors Framework:
•
• EMC Documentum, FileNet, JDBC, LiveLink, Patriarch
(Memex), Meridio, SharePoint, RSS
• New connectors should be written for LCF
– Another option:
•
• Sharepoint, IMAP, Documentum, Vignette, Filesystem
cominvent as
50. Operations
– Solr has no admin-server (coming in 1.5)
– Possible to run multiple Tomcat on same server
– Multiple cores in same Tomcat – easier migration
– No built-in query reports, use 3rd party tools
– No built-in monitoring, have a look at
– Log analysis? Check out
cominvent as
52. Thank You
www.cominvent.com
jh@cominvent.com
www.twitter.com/cominvent
linkedin.com/in/janhoy
This presentation licensed under CC-by-sa license
cominvent as You must attribute Cominvent with name and link