Solr search engine with multiple table relation

Powerful Full-Text Search
with Solr
Jay Bharat
jay@carmatec.com
Carmatec It solution, Bangalore
1 July 2013

1

An introduction to Solr
Implementing search with free
software

2

What is Solr?
•  Solr is an open source enterprise search
server based on the Lucene Java search
library.
•  Solr runs in a Java servlet container such
as Tomcat or Jetty
•  Solr is free software and a project of the
Apache Software Foundation
•  Solr is a sub-project of Lucene and can be
found at http://lucene.apache.org/solr/

5

Key Features
•  Advanced Full-Text search
•  Optimized for High Volume Web Traffic
•  Standards Based Open Interfaces – XML and
HTTP
•  Comprehensive HTML Administration Interface
•  Server statistics exposed over JMX for monitoring
•  Scalability through efficient replication
•  Flexibility with XML configuration and Plugins
•  Push vs Crawl indexing method
6

Solr Clients
•  Solr can be integrated with, among others…
–  Ruby
–  PHP
–  Java
–  Python
–  JSON
–  Forrest/Cocoon
–  C# or Deveel Solr Client or solrnet
–  Coldfusion
–  Drupal or apacheSolr project for Drupal
7

Indexing
• 
• 
• 
• 

Push vs Crawl
Schema.xml
Add documents
HTML interface
–  Update
–  Delete
–  Commit

•  DataImportHandler
–  For searching databases

8

Searching
•  Full text search
http://localhost:8983/solr/select?q=Iraq
§  Search only within a field
http://localhost:8983/solr/select?
q=category:news
§  Control which fields are displayed in result
q=video&fl=id,category
9
§  Provide ranges to fields

More Searching
•  Faceting information
q=news&fl=id,description&facet=true&facet.fi
eld=category
§  More like this (MLT)
q=Iraq&mlt=true&mlt.fl=headline&mlt.mindf=1
&mlt.mintf=1&fl=id,score&rows=100
•  More information on how this works and the
options available can be found at
http://wiki.apache.org/solr/MoreLikeThis
10

QueryResponseWriter
§  A QueryResponseWriter is a Solr Plugin
that defines the response format for any
request
§  All of the requests we have made so far
are formatted with the
XMLResponseWriter
§  Other formats can be applied by
appending wt=format to the search string
like this:
http://localhost:8983/solr/select?q=date:

11

Acknowledgements
•  Search smarter with Apache Solr, Part 1:
Essential features and the Solr schema
–  http://www.ibm.com/developerworks/java/
library/j-solr1/

•  Solr Tutorial from Lucid Imagination
–  http://www.lucidimagination.com/Community/
Hear-from-the-Experts/Podcasts-and-Videos/
Solr-Tutorial

•  Solr Wiki
–  http://wiki.apache.org/solr/

12

Powered by Lucene
•  Wikipedia
•  Internet Archive
•  LinkedIn
•  monster.com

13

Indexing
aardvark

0

Little Red Riding Hood
hood

0

1

little

0

2

1

Robin Hood
red

0

riding

0

robin

1

2

Little Women
women
zoo

2
14

Search
•  Core parameters
•  qt – query type (request handler)
•  wt – writer type (response writer)

•  Common parameters
•  q
•  sort
•  start
•  rows
•  fq – filters
•  fl – return fields
15

Search Syntax
•  field:term (*:* returns everything)
•  A score is generated at query time, the value itself doesn’t have any meaning, the
scores are relevant only when relative to each other (a scale)
•  fq can filter query based on some supplied condition
•  wt is the return type of the results (xml,json, etc.)
•  qt is the request handler used to process the request (default is “standard”)
•  fl is the list of fields to return (field must be stored)
•  q is the query string
•  You can specify the start value and maxrows

16

Search Syntax
•  field:term (*:* returns everything)
•  A score is generated at query time, the value itself
doesn’t have any meaning, the scores are relevant only
when relative to each other (a scale)
•  fq can filter query based on some supplied condition
•  wt is the return type of the results (xml,json, etc.)
•  qt is the request handler used to process the request
(default is “standard”)
•  fl is the list of fields to return (field must be stored)
•  q is the query string
•  You can specify the start value and maxrows
17

What is Lucene
•  High performance, scalable, full-text
search library
•  Focus: Indexing + Searching Documents
–  “Document” is just a list of name+value pairs

•  No crawlers or document parsing
•  Flexible Text Analysis (tokenizers + token
filters)
•  100% Java, no dependencies, no config
files
18

What is SOLR
•  Solr (pronounced "solar") is an open source
enterprise search platform from the Apache
Lucene project. Its major features include fulltext search, hit highlighting, faceted search,
dynamic clustering, database integration, and
rich document (e.g., Word, PDF) handling.
Providing distributed search and index
replication, Solr is highly scalable.[1] Solr is the
most popular enterprise search engine.[2] Solr 4
adds NoSQL features.[3]
19

What is SOLR
•  Solr (pronounced "solar") is an open source
enterprise search platform from the Apache
Lucene project. Its major features include fulltext search, hit highlighting, faceted search,
dynamic clustering, database integration, and
rich document (e.g., Word, PDF) handling.
Providing distributed search and index
replication, Solr is highly scalable.[1] Solr is the
most popular enterprise search engine.[2] Solr 4
adds NoSQL features.[3]
20

Solr Features
•  Advanced Full-Text Search Capabilities
•  Optimized for High Volume Web Traffic
•  Standards Based Open Interfaces - XML, JSON and
HTTP
•  Comprehensive HTML Administration Interfaces
•  Linearly scalable, auto index replication, auto failover
and recovery
•  Near Real-time indexing
•  Flexible and Adaptable with XML configuration
•  Extensible Plugin Architecture
21

Indexing Data
HTTP POST to http://localhost:8983/solr/update
<add><doc>
<field name=“id”>05991</field>
<field name=“name”>Peter Parker</field>
<field name=“supername”>Spider-Man</field>
<field name=“category”>superhero</field>
<field name=“powers”>agility</field>
<field name=“powers”>spider-sense</field>
</doc></add>
22

Data upload methods
URL=http://localhost:8983/solr/update/csv

•  HTTP POST body (curl, HttpClient, etc)
curl $URL -H 'Content-type:text/plain;
charset=utf-8' --data-binary @info.csv

•  Multi-part file upload (browsers)
•  Request parameter
?stream.body=‘Cyclops, Scott Summers,…’

•  Streaming from URL (must enable)
?stream.url=file://data/info.csv

24

Indexing with SolrJ
// Solr’s Java Client API… remote or embedded/local!
SolrServer server = new
CommonsHttpSolrServer("http://localhost:8983/solr");
SolrInputDocument doc = new SolrInputDocument();
doc.addField(”player","Dravid");
doc.addField("name",”Kumar Rahul");
doc.addField(“category",“superhero");
server.add(doc);
server.commit();

25

Deleting Documents
•  Delete by Id, most efficient
<delete>
<id>05591</id>
<id>32552</id>
</delete>
•  Delete by Query
<delete>
<query>category:supervillain</query>
</delete>
26

Commit
•  <commit/> makes changes visible
–  Triggers static cache warming in
solrconfig.xml
–  Triggers autowarming from existing caches
default on

•  <optimize/> same as commit, merges all
index segments for faster searching
_0.fnm
_0.fdt
_0.fdx
_0.frq
_0.tis
_0.tii
_0.prx
_0.nrm
_0_1.del

Lucene Index Segments
_1.fnm
_1.fdt
_1.fdx
[…]

27

Searching
http://localhost:8983/solr/select?q=powers:agility
&start=0&rows=2&fl=supername,category
<response>
<result numFound=“427" start="0">
<doc>
<str name=“supername">Spider-Man</str>
<str name=“category”>superhero</str>
</doc>
<doc>
<str name=“supername">Msytique</str>
<str name=“category”>supervillain</str>
</doc>
</result>
</response>

28

Response Format
•  Add &wt=json for JSON formatted response
{“result": {"numFound":427, "start":0,
"docs": [
{“supername”:”Spider-Man”, “category”:”superhero”},
{“supername”:” Magento”, “category”:” Purvankara”}
]
}
•  Also Python, Ruby, PHP, SerializedPHP, XSLT
29

Scoring
• 
• 
• 
• 
• 
• 

Query results are sorted by score descending
VSM – Vector Space Model
tf – term frequency: numer of matching terms in field
lengthNorm – number of tokens in field
idf – inverse document frequency
coord – coordination factor, number of matching
terms
•  document boost
•  query clause boost
http://lucene.apache.org/java/docs/scoring.html
30

Explain
http://solr/select?q=super fast&indent=on&debugQuery=on
<lst name="debug">
<lst name="explain">
<str name="id=Flash,internal_docid=6">
0.16389132 = (MATCH) product of:
0.32778263 = (MATCH) sum of:
0.32778263 = (MATCH) weight(text:fast in 6), product of:
0.5012072 = queryWeight(text:fast), product of:
2.466337 = idf(docFreq=5)
0.20321926 = queryNorm
0.65398633 = (MATCH) fieldWeight(text:fast in 6), product of:
1.4142135 = tf(termFreq(text:fast)=2)
2.466337 = idf(docFreq=5)
0.1875 = fieldNorm(field=fast, doc=6)
0.5 = coord(1/2)
</str>
<str name="id=Superman,internal_docid=7">
0.1365761 = (MATCH) product of:

31

Lucene Query Syntax
1.  justice league
•  Equiv: justice OR league
•  QueryParser default operator is “OR”/optional
2.  +justice +league –name:aquaman
•  Equiv: justice AND league NOT name:aquaman
3.  “justice league” –name:aquaman
4.  title:spiderman^10 description:spiderman
5.  description:“spiderman movie”~100

32

Lucene Query Examples2
1.  releaseDate:[2000 TO 2007]
2.  Wildcard searches: sup?r, su*r, super*
3.  spider~
• 
• 

Fuzzy search: Levenshtein distance
Optional minimum similarity: spider~0.7

4.  *:*
5.  (Superman AND “Lex Luthor”) OR
(+Batman +Joker)
33

DisMax Query Syntax
• 

Good for handling raw user queries

–  Balanced quotes for phrase query
–  ‘+’ for required, ‘-’ for prohibited
–  Separates query terms from query structure
http://solr/select?qt=dismax
&q=super man
// the user query
&qf=title^3 subject^2 body
// field to query
&pf=title^2,body
// fields to do phrase queries
&ps=100
// slop for those phrase q’s
&tie=.1
// multi-field match reward
&mm=2
// # of terms that should match
&bf=popularity
// boost function
34

DisMax Query Form
•  The expanded Lucene Query:

+( DisjunctionMaxQuery( title:super^3 |
subject:super^2 | body:super)
DisjunctionMaxQuery( title:man^3 |
subject:man^2 | body:man)
)
DisjunctionMaxQuery(title:”super man”~100^2
body:”super man”~100)
FunctionQuery(popularity)
•  Tip: set up your own request handler with default parameters
35
to avoid clients having to specify them

Function Query
•  Allows adding function of field value to score
–  Boost recently added or popular documents

•  Current parser only supports function
notation
•  Example: log(sum(popularity,1))
•  sum, product, div, log, sqrt, abs, pow
•  scale(x, target_min, target_max)
–  calculates min & max of x across all docs

•  map(x, min, max, target)
–  useful for dealing with defaults

36

Boosted Query
•  Score is multiplied instead of added
–  New local params <!...> syntax added

&q=<!boost b=sqrt(popularity)>super man
•  Parameter dereferencing in local params
&q=<!boost b=$boost v=$userq>
&boost=sqrt(popularity)
&userq=super man
37

Configuring Relevancy

<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt“/>
<filter class="solr.StopFilterFactory“
words=“stopwords.txt”/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
</analyzer>
</fieldType>
38

Field Definitions
•  Field Attributes: name, type, indexed, stored,
multiValued, omitNorms, termVectors
<field name="id“
type="string"
indexed="true" stored="true"/>
<field name="sku“
type="textTight” indexed="true" stored="true"/>
<field name="name“ type="text“
indexed="true" stored="true"/>
<field name=“inStock“ type=“boolean“ indexed="true“ stored=“false"/>
<field name=“price“
type=“sfloat“
indexed="true“ stored=“false"/>
<field name="category“ type="text_ws“ indexed="true" stored="true“
multiValued="true"/>

•  Dynamic Fields
<dynamicField name="*_i" type="sint“ indexed="true" stored="true"/>
<dynamicField name="*_s" type="string“ indexed="true" stored="true"/>
<dynamicField name="*_t" type="text“ indexed="true" stored="true"/>
39

copyField
•  Copies one field to another at index time
•  Usecase #1: Analyze same field different ways
–  copy into a field with a different analyzer
–  boost exact-case, exact-punctuation matches
–  language translations, thesaurus, soundex

<field name=“title” type=“text”/>
<field name=“title_exact” type=“text_exact”
stored=“false”/>
<copyField source=“title” dest=“title_exact”/>
•  Usecase #2: Index multiple fields into single
searchable field
40

Facet Query

http://solr/select?q=foo&wt=json&indent=on
&facet=true&facet.field=cat
&facet.query=price:[0 TO 100]
&facet.query=manu:IBM
{"response":{"numFound":26,"start":0,"docs":[…]},
“facet_counts":{
"facet_queries":{
"price:[0 TO 100]":6,
“manu:IBM":2},
"facet_fields":{
"cat":[ "electronics",14, "memory",3,
"card",2, "connector",2]
44
}}}

Filters
•  Filters are restrictions in addition to the query
•  Use in faceting to narrow the results
•  Filters are cached separately for speed
1. User queries for memory, query sent to solr is
&q=memory&fq=inStock:true&facet=true&…
2. User selects 1GB memory size
&q=memory&fq=inStock:true&fq=size:1GB&…
3. User selects DDR2 memory type
&q=memory&fq=inStock:true&fq=size:1GB
&fq=type:DDR2&…
45

Highlighting
http://solr/select?q=lcd&wt=json&indent=on
&hl=true&hl.fl=features
{"response":{"numFound":5,"start":0,"docs":[
{"id":"3007WFP", “price”:899.95}, …]
"highlighting":{
"3007WFP":{ "features":["30" TFT active matrix
<em>LCD</em>, 2560 x 1600”
"VA902B":{ "features":["19" TFT active matrix
<em>LCD</em>, 8ms response time, 1280 x
46
1024 native resolution"]}}}

MoreLikeThis
•  Selects documents that are “similar” to the
documents matching the main query.
&q=id:6H500F0
&mlt=true&mlt.fl=name,cat,features
"moreLikeThis":{ "6H500F0":{"numFound":
5,"start":0,
"docs”: [
{"name":"Apple 60 GB iPod with Video
Playback Black", "price":399.0,
"inStock":true, "popularity":10, […]
}, […]
]
[…]

47

High Availability

Dynamic
HTML
Generation

Appservers

HTTP search
requests

Load Balancer
Solr Searchers

Index Replication
admin queries
updates

updates
admin terminal

Updater

DB

Solr Master
48

Resources
•  WWW
–  http://lucene.apache.org/solr
–  http://lucene.apache.org/solr/tutorial.html
–  http://wiki.apache.org/solr/

•  Mailing Lists
–  solr-user-subscribe@lucene.apache.org
–  solr-dev-subscribe@lucene.apache.org

49

Solr search engine with multiple table relation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Solr search engine with multiple table relation

Ähnlich wie Solr search engine with multiple table relation (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Solr search engine with multiple table relation