Solr 3.1 and beyond

Solr 3.1 and Beyond
yonik@lucidimagination.com
October 8, 2010
2
Lucid Imagination
Yonik Seeley

Agenda
Goal : Introduce new features you can try & use now in
Solr development versions 3.1 or 4.0
  Relevancy (Extended Dismax Parser)
  Spatial/Geo Search
  Search Result Grouping / Field Collapsing
  Faceting (Pivot, Range, Per-segment)
  Scalability (Solr Cloud)
  Odds & Ends
  Q&A
10/12/10 3

Solr 3.1? What happened to 1.5?
  Lucene/Solr merged (March 2010)
  Single set of committers
  Single dev mailing list (dev@lucene.apache.org)
  Single shared subversion trunk
  Keep separate downloads, user mailing lists
  Other former lucene subprojects spun off (Nutch, Tika, Mahout, etc)
  Development
  trunk is now always next major release (currently 4.0)
  branch_3x will be base for all 3.x releases
  Branch together, Release together, Share version numbers

Extended Dismax Parser
  Superset of dismax
&defType=edismax&q=foo&qf=body

  Fixes edge cases where dismax could still throw
exceptions
OR

AND

NOT

-‐

“

  Full lucene syntax support
  Tries lucene syntax first
  Smart escaping is done if syntax errors
  Optionally supports treating “and”/”or” as AND/OR in
lucene syntax
  Fielded queries (e.g. myfield:foo) even in degraded
mode
  uf parameter controls what field names may be directly specified in “q”

Extended Dismax Parser (continued)
  boost parameter for multiplicative boost-by-function
  Pure negative query clauses
Example: solr
OR
(-‐solr)

  Enhanced term proximity boosting
  pf2=myfield – results in term bigrams in sloppy phrase queries

myfield:“aa
bb
cc”

-‐>

myfield:“aa
bb”

myfield:“bb
cc”

  Enhanced stopword handling
  stopwords omitted in main query, but added in optional proximity boosting part
Example: q=solr
is
awesome
&
qf=myfield
&
pf2=myfield

-‐>

+myfield:(solr
awesome)

(myfield:”solr
is”
myfield:”is

awesome”)

  Currently controlled by the absence of StopWordFilter in index analyzer, and
presence in query analyzer

Spatial Search
10/12/10 9
Step1: Index some locations!
<field name=“name”>The Alpine Shop</field>
<field name=“store”>44.013617,-73.168264</field>
Step2: Decide where you are
&pt=44.0153371,-73.16734
&d=1
&sfield=store
Step3: Profit!
Spatial Filter: &fq={!geofilt}
Bounding Box: &fq={!bbox}
Distance Function: &sort=geodist() asc

RESULT GROUPING /
FIELD COLLAPSING

Field Collapsing Definition
 Field collapsing
  Limit the number of results per category
  “category” normally defined by unique values in a field
 Uses
  Web Search – collapse by web site
  Email threads – collapse by thread id
  Ecommerce/retail
  Show the top 5 items for each store category (music, movies,
etc)

Field Collapse on Product Type
Result Grouping by Category

Group by Field
http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact
10/12/10 14
"grouped":{
"manu_exact":{
"matches":3,
"groups":[{
"groupValue":"Belkin",
"doclist":{"numFound":2,"start":0,"docs":[
{
"id":"IW-02",
"name":"iPod & iPod Mini USB 2.0 Cable"}]
}},
{
"groupValue":"Apple Computer Inc.",
{
"id":"MA147LL/A",

Group by Query
10/12/10 15
http://...&group=true&group.query=price:[0 TO 99.99]
&group.query=price:[100 TO *]&group.limit=5
"grouped":{
"price:[0 TO 99.99]":{
"matches":3,
{
"id":"IW-02",
"name":"iPod & iPod Mini USB 2.0 Cable"},
{
"id":"F8V7067-APL-KIT",
"name":"Belkin Mobile Power Cord for iPod"}]
}},
"price:[100 TO *]":{
"matches":3,
{

Grouping Params
parameter meaning default
group.field=<field> Like facet.field – group by unique field
values
group.query=<query> Like facet.query – top docs that also
match
group.function=<function
query>
Group by unique values produced by
the function query
group.limit=<n> How many docs per group 1
group.sort=<sort spec> How to sort documents within a group Same as
“sort”
param
rows=<n> How many groups to return 10
sort=<sort spec> How to sort the groups relative to
each other (based on top doc)
10/12/10 16

Pivot Faceting
  Other names that could have made sense:
  Grid Faceting, Cross-Product Faceting, Matrix Faceting
  Syntax: facet.pivot=field1,field2,field3,…
10/12/10 18
#docs #docs w/
inStock:true
#docs w/
instock:false
cat:electronics 14 10 4
cat:memory 3 3 0
cat:connector 2 0 2
cat:graphics card 2 0 2
cat:hard drive 2 2 0
facet.pivot=cat,inStock

Pivot Faceting
"facet_counts":{
"facet_pivot":{
"cat,popularity":[{
"field":"cat",
"value":"electronics",
"count":14,
"pivot":[{
"field":"popularity",
"value":"6",
"count":5},
{
"value":"7",
"count":4},
10/12/10 19
http://...&facet=true&facet.pivot=cat,popularity
(continued)
{
"value":"1",
"count":2}]},
{
"field":"cat",
"value":"memory",
"count":3,
"pivot":[]},
[…]
14 docs w/
cat==electronics
5 docs w/
cat==electronics
&& popularity==6

Range Faceting
•  Like Date faceting, but
more generic
http://...&facet=true
&facet.range=price
&facet.range.start=0
&facet.range.end=500
&facet.range.gap=50
"facet_counts":{
"facet_ranges":{
"price":{
"counts":{
"0.0":5,
"50.0":2,
"100.0":0,
"150.0":2,
"200.0":0,
"250.0":1,
"300.0":2,
"350.0":2,
"400.0":0,
"450.0":1},
"gap":50.0,
"start":0.0,
"end":500.0}}}}
10/12/10 20

5
3
5
1
4
5
2
1
(null)
batman
flash
spiderman
superman
wolverine
order: for each
doc, an index into
the lookup array
lookup: the
string values
Lucene FieldCache Entry
(StringIndex) for the “hero” field
0
2
7
0
1
0
0
0
2
Documents
matching the
base query
“Juggernaut”
accumulator
increment
lookup
q=Juggernaut
&facet=true
&facet.field=hero
Priority queue
Batman, 3
flash, 5
Existing single-valued faceting
algorithm

Segment1
FieldCache
Entry
Segment2
FieldCache
Entry
Segment3
FieldCache
Entry
Segment4
FieldCache
Entry
0
2
7
0
3
5
0
1
2
0
2
1
0
1
3
0
4
0
1
0
Priority queue
Batman, 3
flash, 5
Base
DocSet
lookup
inc
accumulator1 accumulator2 accumulator3 accumulator4
FieldCache +
accumulator
merger
(Priority queue)
thread1
thread2 thread3
thread4
Per-segment single-valued
algorithm

Per-segment faceting
  Enable with facet.method=fcs
  Controllable multi-threading
facet.field={!threads=4}myfield

  Disadvantages
  Larger memory use (FieldCaches + accumulators)
  Slower (extra FieldCache merge step needed)
  Advantages
  Rebuilds FieldCache entries only for new segments (NRT friendly)
  Multi-threaded

Per-segment faceting performance
comparison
Time for request* facet.method=fc facet.method=fcs
static index 3 ms 244 ms
quickly changing index 1388 ms 267 ms
Base DocSet=100 docs, facet.field on a field with 100,000 unique terms
Test index: 10M documents, 18 segments, single valued field
Time for request* facet.method=fc facet.method=fcs
static index 26 ms 34 ms
quickly changing index 741 ms 94 ms
Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms
*complete request time, measured externally
A
B

Faceting Performance Improvements
  For facet.method=enum, speed up initial
population of the filterCache (i.e. first time
facet): from 30% to 32x improvement
  Optimized facet.method=fc for multi-valued
fields and large facet.limit – up to 3x faster
  Optimized deep facet paging – up to 10x faster
with really large facet.offsets
  Less memory consumed by field cache entries
10/12/10 25

SolrCloud
  First steps toward simplifying cluster management
  Integrates Zookeeper
  Central configuration (schema.xml, solrconfig.xml, etc)
  Tracks live nodes + shards of collections
  Removes need for external load balancers
shards=localhost:8983/solr|localhost:8900/solr,

localhost:7574/solr|localhost:7500/solr

  Can specify logical shard ids
shards=NY_shard,NJ_shard

  Clients don’t need to know shards at all:
http://localhost:8983/solr/collection1/select?distrib=true

SolrCloud : The Future
  Eliminate all single points of failure
  Remove Master/Searcher distinction
  Enables near real-time search in a highly scalable environment
  High Availability for Writes
  Eventual consistency model (like Amazon Dynamo, Cassandra)
  Elastic
  Simply add/subtract servers, cluster will rebalance automatically
  By default, Solr will handle document partitioning

Auto-Suggest
  Many people currently use terms component
  Can be slow for a large corpus
  New auto-suggest builds off SpellCheck component
  Compact memory based trie for really fast completions
  Based on a field in the main index, or on a dictionary file
http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult
10/12/10 30
"spellcheck":{
"suggestions":[
"ult",{
"numFound":1,
"startOffset":0,
"endOffset":3,
"suggestion":["ultrasharp"]},
"collation","ultrasharp"]}}

Index with JSON
$
URL=http://localhost:8983/solr/update/json

$
curl
$URL
-‐H
'Content-‐type:application/json'
-‐d
'

{

"add":
{

"doc":
{

"id"
:
"978-‐0641723445",

"cat"
:
["book","hardcover"],

"title"
:
"The
Lightning
Thief",

"author"
:
"Rick
Riordan",

"series_t"
:
"Percy
Jackson
and
the
Olympians",

"sequence_i"
:
1,

"genre_s"
:
"fantasy",

"inStock"
:
true,

"price"
:
12.50,

"pages_i"
:
384

}

}

}'

31

Query Results in CSV
http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv
name,price,cat,popularity
iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1
Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1
Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10
  Can handle multi-valued fields (see “cat” field in example)
  Completely compatible with the CSV update handler (can round-trip)
  Results are streamed – good for dumping entire parts of the index
10/12/10 32

http://localhost:8983/solr/browse
10/12/10 33

Q&A
For more information about Solr visit
www.lucidimagination.com

Solr 3.1 and beyond

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Solr 3.1 and beyond

Ähnlich wie Solr 3.1 and beyond (20)

Mehr von Lucidworks (Archived)

Mehr von Lucidworks (Archived) (20)

Solr 3.1 and beyond