SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Solr 3.1 and Beyond
yonik@lucidimagination.com
October 8, 2010
2
Lucid Imagination
Yonik Seeley
Agenda
Goal : Introduce new features you can try & use now in
Solr development versions 3.1 or 4.0
  Relevancy (Extended Dismax Parser)
  Spatial/Geo Search
  Search Result Grouping / Field Collapsing
  Faceting (Pivot, Range, Per-segment)
  Scalability (Solr Cloud)
  Odds & Ends
  Q&A
10/12/10 3
Solr 3.1? What happened to 1.5?
  Lucene/Solr merged (March 2010)
  Single set of committers
  Single dev mailing list (dev@lucene.apache.org)
  Single shared subversion trunk
  Keep separate downloads, user mailing lists
  Other former lucene subprojects spun off (Nutch, Tika, Mahout, etc)
  Development
  trunk is now always next major release (currently 4.0)
  branch_3x will be base for all 3.x releases
  Branch together, Release together, Share version numbers
RELEVANCE
Extended Dismax Parser
  Superset of dismax
&defType=edismax&q=foo&qf=body	
  
  Fixes edge cases where dismax could still throw
exceptions
OR	
  	
  	
  AND	
  	
  	
  NOT	
  	
  	
  -­‐	
  	
  	
  “	
  
  Full lucene syntax support
  Tries lucene syntax first
  Smart escaping is done if syntax errors
  Optionally supports treating “and”/”or” as AND/OR in
lucene syntax
  Fielded queries (e.g. myfield:foo) even in degraded
mode
  uf parameter controls what field names may be directly specified in “q”
Extended Dismax Parser (continued)
  boost parameter for multiplicative boost-by-function
  Pure negative query clauses
Example: solr	
  OR	
  (-­‐solr)	
  
  Enhanced term proximity boosting
  pf2=myfield – results in term bigrams in sloppy phrase queries
	
  myfield:“aa	
  bb	
  cc”	
  	
  -­‐>	
  	
  myfield:“aa	
  bb”	
  	
  myfield:“bb	
  cc”	
  
  Enhanced stopword handling
  stopwords omitted in main query, but added in optional proximity boosting part
Example: q=solr	
  is	
  awesome	
  &	
  qf=myfield	
  &	
  pf2=myfield	
  	
  	
  -­‐>	
  	
  	
  	
  
	
  +myfield:(solr	
  awesome)	
  	
  (myfield:”solr	
  is”	
  myfield:”is	
  
awesome”)	
  
  Currently controlled by the absence of StopWordFilter in index analyzer, and
presence in query analyzer
SPATIAL SEARCH
8
Spatial Search
10/12/10 9
Step1: Index some locations!
<field name=“name”>The Alpine Shop</field>
<field name=“store”>44.013617,-73.168264</field>
Step2: Decide where you are
&pt=44.0153371,-73.16734
&d=1
&sfield=store
Step3: Profit!
Spatial Filter: &fq={!geofilt}
Bounding Box: &fq={!bbox}
Distance Function: &sort=geodist() asc
RESULT GROUPING /
FIELD COLLAPSING
Field Collapsing Definition
 Field collapsing
  Limit the number of results per category
  “category” normally defined by unique values in a field
 Uses
  Web Search – collapse by web site
  Email threads – collapse by thread id
  Ecommerce/retail
  Show the top 5 items for each store category (music, movies,
etc)
Field Collapsing by Site
Field Collapse on Product Type
Result Grouping by Category
Group by Field
http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact
10/12/10 14
"grouped":{
"manu_exact":{
"matches":3,
"groups":[{
"groupValue":"Belkin",
"doclist":{"numFound":2,"start":0,"docs":[
{
"id":"IW-02",
"name":"iPod & iPod Mini USB 2.0 Cable"}]
}},
{
"groupValue":"Apple Computer Inc.",
"doclist":{"numFound":1,"start":0,"docs":[
{
"id":"MA147LL/A",
Group by Query
10/12/10 15
http://...&group=true&group.query=price:[0 TO 99.99]
&group.query=price:[100 TO *]&group.limit=5
"grouped":{
"price:[0 TO 99.99]":{
"matches":3,
"doclist":{"numFound":2,"start":0,"docs":[
{
"id":"IW-02",
"name":"iPod & iPod Mini USB 2.0 Cable"},
{
"id":"F8V7067-APL-KIT",
"name":"Belkin Mobile Power Cord for iPod"}]
}},
"price:[100 TO *]":{
"matches":3,
"doclist":{"numFound":1,"start":0,"docs":[
{
Grouping Params
parameter meaning default
group.field=<field> Like facet.field – group by unique field
values
group.query=<query> Like facet.query – top docs that also
match
group.function=<function
query>
Group by unique values produced by
the function query
group.limit=<n> How many docs per group 1
group.sort=<sort spec> How to sort documents within a group Same as
“sort”
param
rows=<n> How many groups to return 10
sort=<sort spec> How to sort the groups relative to
each other (based on top doc)
10/12/10 16
FACETING
Pivot Faceting
  Other names that could have made sense:
  Grid Faceting, Cross-Product Faceting, Matrix Faceting
  Syntax: facet.pivot=field1,field2,field3,…
10/12/10 18
#docs #docs w/
inStock:true
#docs w/
instock:false
cat:electronics 14 10 4
cat:memory 3 3 0
cat:connector 2 0 2
cat:graphics card 2 0 2
cat:hard drive 2 2 0
facet.pivot=cat,inStock
Pivot Faceting
"facet_counts":{
"facet_pivot":{
"cat,popularity":[{
"field":"cat",
"value":"electronics",
"count":14,
"pivot":[{
"field":"popularity",
"value":"6",
"count":5},
{
"field":"popularity",
"value":"7",
"count":4},
10/12/10 19
http://...&facet=true&facet.pivot=cat,popularity
(continued)
{
"field":"popularity",
"value":"1",
"count":2}]},
{
"field":"cat",
"value":"memory",
"count":3,
"pivot":[]},
[…]
14 docs w/
cat==electronics
5 docs w/
cat==electronics
&& popularity==6
Range Faceting
•  Like Date faceting, but
more generic
http://...&facet=true
&facet.range=price
&facet.range.start=0
&facet.range.end=500
&facet.range.gap=50
"facet_counts":{
"facet_ranges":{
"price":{
"counts":{
"0.0":5,
"50.0":2,
"100.0":0,
"150.0":2,
"200.0":0,
"250.0":1,
"300.0":2,
"350.0":2,
"400.0":0,
"450.0":1},
"gap":50.0,
"start":0.0,
"end":500.0}}}}
10/12/10 20
5
3
5
1
4
5
2
1
(null)
batman
flash
spiderman
superman
wolverine
order: for each
doc, an index into
the lookup array
lookup: the
string values
Lucene FieldCache Entry
(StringIndex) for the “hero” field
0
2
7
0
1
0
0
0
2
Documents
matching the
base query
“Juggernaut”
accumulator
increment
lookup
q=Juggernaut
&facet=true
&facet.field=hero
Priority queue
Batman, 3
flash, 5
Existing single-valued faceting
algorithm
Segment1
FieldCache
Entry
Segment2
FieldCache
Entry
Segment3
FieldCache
Entry
Segment4
FieldCache
Entry
0
2
7
0
3
5
0
1
2
0
2
1
0
1
3
0
4
0
1
0
Priority queue
Batman, 3
flash, 5
Base
DocSet
lookup
inc
accumulator1 accumulator2 accumulator3 accumulator4
FieldCache +
accumulator
merger
(Priority queue)
thread1
thread2 thread3
thread4
Per-segment single-valued
algorithm
Per-segment faceting
  Enable with facet.method=fcs
  Controllable multi-threading
facet.field={!threads=4}myfield	
  
  Disadvantages
  Larger memory use (FieldCaches + accumulators)
  Slower (extra FieldCache merge step needed)
  Advantages
  Rebuilds FieldCache entries only for new segments (NRT friendly)
  Multi-threaded
Per-segment faceting performance
comparison
Time for request* facet.method=fc facet.method=fcs
static index 3 ms 244 ms
quickly changing index 1388 ms 267 ms
Base DocSet=100 docs, facet.field on a field with 100,000 unique terms
Test index: 10M documents, 18 segments, single valued field
Time for request* facet.method=fc facet.method=fcs
static index 26 ms 34 ms
quickly changing index 741 ms 94 ms
Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms
*complete request time, measured externally
A
B
Faceting Performance Improvements
  For facet.method=enum, speed up initial
population of the filterCache (i.e. first time
facet): from 30% to 32x improvement
  Optimized facet.method=fc for multi-valued
fields and large facet.limit – up to 3x faster
  Optimized deep facet paging – up to 10x faster
with really large facet.offsets
  Less memory consumed by field cache entries
10/12/10 25
SCALABILITY
SolrCloud
  First steps toward simplifying cluster management
  Integrates Zookeeper
  Central configuration (schema.xml, solrconfig.xml, etc)
  Tracks live nodes + shards of collections
  Removes need for external load balancers
shards=localhost:8983/solr|localhost:8900/solr,	
  
	
  	
  	
  	
  	
  	
  	
  localhost:7574/solr|localhost:7500/solr	
  
  Can specify logical shard ids
shards=NY_shard,NJ_shard	
  
  Clients don’t need to know shards at all:
http://localhost:8983/solr/collection1/select?distrib=true	
  
SolrCloud : The Future
  Eliminate all single points of failure
  Remove Master/Searcher distinction
  Enables near real-time search in a highly scalable environment
  High Availability for Writes
  Eventual consistency model (like Amazon Dynamo, Cassandra)
  Elastic
  Simply add/subtract servers, cluster will rebalance automatically
  By default, Solr will handle document partitioning
ODDS & ENDS
Auto-Suggest
  Many people currently use terms component
  Can be slow for a large corpus
  New auto-suggest builds off SpellCheck component
  Compact memory based trie for really fast completions
  Based on a field in the main index, or on a dictionary file
http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult
10/12/10 30
"spellcheck":{
"suggestions":[
"ult",{
"numFound":1,
"startOffset":0,
"endOffset":3,
"suggestion":["ultrasharp"]},
"collation","ultrasharp"]}}
Index with JSON
$	
  URL=http://localhost:8983/solr/update/json	
  
$	
  curl	
  $URL	
  -­‐H	
  'Content-­‐type:application/json'	
  -­‐d	
  '	
  
{	
  
"add":	
  {	
  
	
  	
  "doc":	
  {	
  
	
  	
  	
  	
  "id"	
  :	
  "978-­‐0641723445",	
  
	
  	
  	
  	
  "cat"	
  :	
  ["book","hardcover"],	
  
	
  	
  	
  	
  "title"	
  :	
  "The	
  Lightning	
  Thief",	
  
	
  	
  	
  	
  "author"	
  :	
  "Rick	
  Riordan",	
  
	
  	
  	
  	
  "series_t"	
  :	
  "Percy	
  Jackson	
  and	
  the	
  Olympians",	
  
	
  	
  	
  	
  "sequence_i"	
  :	
  1,	
  
	
  	
  	
  	
  "genre_s"	
  :	
  "fantasy",	
  
	
  	
  	
  	
  "inStock"	
  :	
  true,	
  
	
  	
  	
  	
  "price"	
  :	
  12.50,	
  
	
  	
  	
  	
  "pages_i"	
  :	
  384	
  
	
  	
  }	
  
}	
  
}'	
  
31
Query Results in CSV
http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv
name,price,cat,popularity
iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1
Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1
Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10
  Can handle multi-valued fields (see “cat” field in example)
  Completely compatible with the CSV update handler (can round-trip)
  Results are streamed – good for dumping entire parts of the index
10/12/10 32
http://localhost:8983/solr/browse
10/12/10 33
Q&A
For more information about Solr visit
www.lucidimagination.com

Weitere ähnliche Inhalte

Andere mochten auch

Tennis
TennisTennis
Tennis
aritz
 
Spanish bombss
Spanish bombssSpanish bombss
Spanish bombss
tanica
 
Using Solr to find the Right Person for the Right Job
Using Solr to find the Right Person for the Right JobUsing Solr to find the Right Person for the Right Job
Using Solr to find the Right Person for the Right Job
Lucidworks (Archived)
 
2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr
Lucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Lucidworks (Archived)
 
The scene- I love you like a love song Selena Gomez
The scene- I love you like a love song Selena GomezThe scene- I love you like a love song Selena Gomez
The scene- I love you like a love song Selena Gomez
tanica
 

Andere mochten auch (20)

Tennis
TennisTennis
Tennis
 
Spanish bombss
Spanish bombssSpanish bombss
Spanish bombss
 
Using Solr to find the Right Person for the Right Job
Using Solr to find the Right Person for the Right JobUsing Solr to find the Right Person for the Right Job
Using Solr to find the Right Person for the Right Job
 
2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr
 
Linked In Introduction
Linked In IntroductionLinked In Introduction
Linked In Introduction
 
第4回「ブラウザー勉強会」オープニング トーク
第4回「ブラウザー勉強会」オープニング トーク第4回「ブラウザー勉強会」オープニング トーク
第4回「ブラウザー勉強会」オープニング トーク
 
Davis mark advanced search analytics in 20 minutes
Davis mark   advanced search analytics in 20 minutesDavis mark   advanced search analytics in 20 minutes
Davis mark advanced search analytics in 20 minutes
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
HTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコルHTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコル
 
Short Presentation
Short PresentationShort Presentation
Short Presentation
 
What’s New in Apache Lucene 2.9
What’s New in Apache Lucene 2.9What’s New in Apache Lucene 2.9
What’s New in Apache Lucene 2.9
 
What’s new in apache lucene 3.0
What’s new in apache lucene 3.0What’s new in apache lucene 3.0
What’s new in apache lucene 3.0
 
Learn How to Master Solr1 4
Learn How to Master Solr1 4Learn How to Master Solr1 4
Learn How to Master Solr1 4
 
Solr & Lucene at Etsy
Solr & Lucene at EtsySolr & Lucene at Etsy
Solr & Lucene at Etsy
 
The scene- I love you like a love song Selena Gomez
The scene- I love you like a love song Selena GomezThe scene- I love you like a love song Selena Gomez
The scene- I love you like a love song Selena Gomez
 
Cmd Training Institute - New Premises
Cmd Training Institute - New PremisesCmd Training Institute - New Premises
Cmd Training Institute - New Premises
 
Integration of apache solr with crawlers
Integration of apache solr with crawlersIntegration of apache solr with crawlers
Integration of apache solr with crawlers
 
Presentation to Virginia Beach Vision, 1 27-14
Presentation to Virginia Beach Vision, 1 27-14Presentation to Virginia Beach Vision, 1 27-14
Presentation to Virginia Beach Vision, 1 27-14
 
Getting started with Lucidworks Enterprise
Getting started with Lucidworks EnterpriseGetting started with Lucidworks Enterprise
Getting started with Lucidworks Enterprise
 
Picasso
PicassoPicasso
Picasso
 

Ähnlich wie Solr 3.1 and beyond

Seeley yonik solr performance key innovations
Seeley yonik   solr performance key innovationsSeeley yonik   solr performance key innovations
Seeley yonik solr performance key innovations
Lucidworks (Archived)
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
Sourcesense
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
thelabdude
 

Ähnlich wie Solr 3.1 and beyond (20)

Seeley yonik solr performance key innovations
Seeley yonik   solr performance key innovationsSeeley yonik   solr performance key innovations
Seeley yonik solr performance key innovations
 
Apache solr
Apache solrApache solr
Apache solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Solr introduction
Solr introductionSolr introduction
Solr introduction
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
 
112 portfpres.pdf
112 portfpres.pdf112 portfpres.pdf
112 portfpres.pdf
 
Solr As A SparkSQL DataSource
Solr As A SparkSQL DataSourceSolr As A SparkSQL DataSource
Solr As A SparkSQL DataSource
 
Webinar: What's New in Solr 6
Webinar: What's New in Solr 6Webinar: What's New in Solr 6
Webinar: What's New in Solr 6
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache Solr
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature Preview
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
The Many Facets of Apache Solr - Yonik Seeley
The Many Facets of Apache Solr - Yonik SeeleyThe Many Facets of Apache Solr - Yonik Seeley
The Many Facets of Apache Solr - Yonik Seeley
 
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, LucidworksLifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
"Solr Update" at code4lib '13 - Chicago
"Solr Update" at code4lib '13 - Chicago"Solr Update" at code4lib '13 - Chicago
"Solr Update" at code4lib '13 - Chicago
 
Drupalcon2007 Sun
Drupalcon2007 SunDrupalcon2007 Sun
Drupalcon2007 Sun
 

Mehr von Lucidworks (Archived)

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Lucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Lucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Lucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Lucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Lucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Lucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Lucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
Lucidworks (Archived)
 

Mehr von Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 

Solr 3.1 and beyond

  • 1. Solr 3.1 and Beyond yonik@lucidimagination.com October 8, 2010 2 Lucid Imagination Yonik Seeley
  • 2. Agenda Goal : Introduce new features you can try & use now in Solr development versions 3.1 or 4.0   Relevancy (Extended Dismax Parser)   Spatial/Geo Search   Search Result Grouping / Field Collapsing   Faceting (Pivot, Range, Per-segment)   Scalability (Solr Cloud)   Odds & Ends   Q&A 10/12/10 3
  • 3. Solr 3.1? What happened to 1.5?   Lucene/Solr merged (March 2010)   Single set of committers   Single dev mailing list (dev@lucene.apache.org)   Single shared subversion trunk   Keep separate downloads, user mailing lists   Other former lucene subprojects spun off (Nutch, Tika, Mahout, etc)   Development   trunk is now always next major release (currently 4.0)   branch_3x will be base for all 3.x releases   Branch together, Release together, Share version numbers
  • 5. Extended Dismax Parser   Superset of dismax &defType=edismax&q=foo&qf=body     Fixes edge cases where dismax could still throw exceptions OR      AND      NOT      -­‐      “     Full lucene syntax support   Tries lucene syntax first   Smart escaping is done if syntax errors   Optionally supports treating “and”/”or” as AND/OR in lucene syntax   Fielded queries (e.g. myfield:foo) even in degraded mode   uf parameter controls what field names may be directly specified in “q”
  • 6. Extended Dismax Parser (continued)   boost parameter for multiplicative boost-by-function   Pure negative query clauses Example: solr  OR  (-­‐solr)     Enhanced term proximity boosting   pf2=myfield – results in term bigrams in sloppy phrase queries  myfield:“aa  bb  cc”    -­‐>    myfield:“aa  bb”    myfield:“bb  cc”     Enhanced stopword handling   stopwords omitted in main query, but added in optional proximity boosting part Example: q=solr  is  awesome  &  qf=myfield  &  pf2=myfield      -­‐>          +myfield:(solr  awesome)    (myfield:”solr  is”  myfield:”is   awesome”)     Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer
  • 8. Spatial Search 10/12/10 9 Step1: Index some locations! <field name=“name”>The Alpine Shop</field> <field name=“store”>44.013617,-73.168264</field> Step2: Decide where you are &pt=44.0153371,-73.16734 &d=1 &sfield=store Step3: Profit! Spatial Filter: &fq={!geofilt} Bounding Box: &fq={!bbox} Distance Function: &sort=geodist() asc
  • 10. Field Collapsing Definition  Field collapsing   Limit the number of results per category   “category” normally defined by unique values in a field  Uses   Web Search – collapse by web site   Email threads – collapse by thread id   Ecommerce/retail   Show the top 5 items for each store category (music, movies, etc)
  • 12. Field Collapse on Product Type Result Grouping by Category
  • 13. Group by Field http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact 10/12/10 14 "grouped":{ "manu_exact":{ "matches":3, "groups":[{ "groupValue":"Belkin", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}] }}, { "groupValue":"Apple Computer Inc.", "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A",
  • 14. Group by Query 10/12/10 15 http://...&group=true&group.query=price:[0 TO 99.99] &group.query=price:[100 TO *]&group.limit=5 "grouped":{ "price:[0 TO 99.99]":{ "matches":3, "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}, { "id":"F8V7067-APL-KIT", "name":"Belkin Mobile Power Cord for iPod"}] }}, "price:[100 TO *]":{ "matches":3, "doclist":{"numFound":1,"start":0,"docs":[ {
  • 15. Grouping Params parameter meaning default group.field=<field> Like facet.field – group by unique field values group.query=<query> Like facet.query – top docs that also match group.function=<function query> Group by unique values produced by the function query group.limit=<n> How many docs per group 1 group.sort=<sort spec> How to sort documents within a group Same as “sort” param rows=<n> How many groups to return 10 sort=<sort spec> How to sort the groups relative to each other (based on top doc) 10/12/10 16
  • 17. Pivot Faceting   Other names that could have made sense:   Grid Faceting, Cross-Product Faceting, Matrix Faceting   Syntax: facet.pivot=field1,field2,field3,… 10/12/10 18 #docs #docs w/ inStock:true #docs w/ instock:false cat:electronics 14 10 4 cat:memory 3 3 0 cat:connector 2 0 2 cat:graphics card 2 0 2 cat:hard drive 2 2 0 facet.pivot=cat,inStock
  • 19. Range Faceting •  Like Date faceting, but more generic http://...&facet=true &facet.range=price &facet.range.start=0 &facet.range.end=500 &facet.range.gap=50 "facet_counts":{ "facet_ranges":{ "price":{ "counts":{ "0.0":5, "50.0":2, "100.0":0, "150.0":2, "200.0":0, "250.0":1, "300.0":2, "350.0":2, "400.0":0, "450.0":1}, "gap":50.0, "start":0.0, "end":500.0}}}} 10/12/10 20
  • 20. 5 3 5 1 4 5 2 1 (null) batman flash spiderman superman wolverine order: for each doc, an index into the lookup array lookup: the string values Lucene FieldCache Entry (StringIndex) for the “hero” field 0 2 7 0 1 0 0 0 2 Documents matching the base query “Juggernaut” accumulator increment lookup q=Juggernaut &facet=true &facet.field=hero Priority queue Batman, 3 flash, 5 Existing single-valued faceting algorithm
  • 21. Segment1 FieldCache Entry Segment2 FieldCache Entry Segment3 FieldCache Entry Segment4 FieldCache Entry 0 2 7 0 3 5 0 1 2 0 2 1 0 1 3 0 4 0 1 0 Priority queue Batman, 3 flash, 5 Base DocSet lookup inc accumulator1 accumulator2 accumulator3 accumulator4 FieldCache + accumulator merger (Priority queue) thread1 thread2 thread3 thread4 Per-segment single-valued algorithm
  • 22. Per-segment faceting   Enable with facet.method=fcs   Controllable multi-threading facet.field={!threads=4}myfield     Disadvantages   Larger memory use (FieldCaches + accumulators)   Slower (extra FieldCache merge step needed)   Advantages   Rebuilds FieldCache entries only for new segments (NRT friendly)   Multi-threaded
  • 23. Per-segment faceting performance comparison Time for request* facet.method=fc facet.method=fcs static index 3 ms 244 ms quickly changing index 1388 ms 267 ms Base DocSet=100 docs, facet.field on a field with 100,000 unique terms Test index: 10M documents, 18 segments, single valued field Time for request* facet.method=fc facet.method=fcs static index 26 ms 34 ms quickly changing index 741 ms 94 ms Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms *complete request time, measured externally A B
  • 24. Faceting Performance Improvements   For facet.method=enum, speed up initial population of the filterCache (i.e. first time facet): from 30% to 32x improvement   Optimized facet.method=fc for multi-valued fields and large facet.limit – up to 3x faster   Optimized deep facet paging – up to 10x faster with really large facet.offsets   Less memory consumed by field cache entries 10/12/10 25
  • 26. SolrCloud   First steps toward simplifying cluster management   Integrates Zookeeper   Central configuration (schema.xml, solrconfig.xml, etc)   Tracks live nodes + shards of collections   Removes need for external load balancers shards=localhost:8983/solr|localhost:8900/solr,                localhost:7574/solr|localhost:7500/solr     Can specify logical shard ids shards=NY_shard,NJ_shard     Clients don’t need to know shards at all: http://localhost:8983/solr/collection1/select?distrib=true  
  • 27. SolrCloud : The Future   Eliminate all single points of failure   Remove Master/Searcher distinction   Enables near real-time search in a highly scalable environment   High Availability for Writes   Eventual consistency model (like Amazon Dynamo, Cassandra)   Elastic   Simply add/subtract servers, cluster will rebalance automatically   By default, Solr will handle document partitioning
  • 29. Auto-Suggest   Many people currently use terms component   Can be slow for a large corpus   New auto-suggest builds off SpellCheck component   Compact memory based trie for really fast completions   Based on a field in the main index, or on a dictionary file http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult 10/12/10 30 "spellcheck":{ "suggestions":[ "ult",{ "numFound":1, "startOffset":0, "endOffset":3, "suggestion":["ultrasharp"]}, "collation","ultrasharp"]}}
  • 30. Index with JSON $  URL=http://localhost:8983/solr/update/json   $  curl  $URL  -­‐H  'Content-­‐type:application/json'  -­‐d  '   {   "add":  {      "doc":  {          "id"  :  "978-­‐0641723445",          "cat"  :  ["book","hardcover"],          "title"  :  "The  Lightning  Thief",          "author"  :  "Rick  Riordan",          "series_t"  :  "Percy  Jackson  and  the  Olympians",          "sequence_i"  :  1,          "genre_s"  :  "fantasy",          "inStock"  :  true,          "price"  :  12.50,          "pages_i"  :  384      }   }   }'   31
  • 31. Query Results in CSV http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv name,price,cat,popularity iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1 Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1 Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10   Can handle multi-valued fields (see “cat” field in example)   Completely compatible with the CSV update handler (can round-trip)   Results are streamed – good for dumping entire parts of the index 10/12/10 32
  • 33. Q&A For more information about Solr visit www.lucidimagination.com