Apache solr

Apache Solr
Oberseminar, 12.06.2015
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen
Péter Király, pkiraly@gwdg.de

What is Apache Solr?
Solr is the popular, blazing-fast, open source enterprise
search platform built on Apache Lucene
2

● 1999: Doug Cutting published Lucene
● 2004: Yonik Seeley published Solr
● 2006: Apache project (2007: TLP)
● 2009: LucidWorks company
● 2010: Merge of Lucene and Solr
● 2011: 3.1
● 2012: 4.0
● 2015: 5.0
History in one minute
3

“Sister” projects
● Nutch: web scale search engine
● Tika: document parser
● Hadoop: distributes storage and data
processing
● Elasticsearch: alternative to Solr
● forks/ports of Lucene
● client libraries and tools (Luke index viewer)
4

Main features I
● Faceted navigation
● Hit highlighting
● Query language
● Schema-less mode and Schema REST API
● JSON, XML, PHP, Ruby, Python, XSLT,
Velocity and custom Java binary outputs
● HTML administration interface
5

Main features II
● Replication to other Solr servers
● Distributed search through sharding
● Search results clustering based on Carrot2
● Extensible through plugins
● Relevance boosting via functions
● Caching - queries, filters, and documents
● Embeddable in a Java Application
6

Main features III
● Geo-spatial search, including multiple
points per documents and polygons
● Automated management of large clusters
through ZooKeeper
● Function queries
● Field Collapsing and grouping
● Auto-suggest
7

Inverted index
Original documents:
Doc # Content field
1 A Fun Guide to Cooking
2 Decorating Your Home
3 How to Raise a Child
4 Buying a New Car
8

Inverted index
Index structure
Term Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 Doc7
a 0 1 1 1 0 0 0
becomming 0 0 0 0 1 0 0
beginner’s 0 0 0 0 0 1 0
buy 0 0 1 0 0 0 0
stored as a bit vectorstored as reference to a tree
structure
9

Indexing
Document ~ RDBM record
Fields (key-value structure):
● types (text, numeric, date, point, custom)
● indexed, stored, multiple, required
● field name patterns (prefixes, suffixes, such
as *_tx)
● special fields (identifier, _version_)
10

Indexing
formats: JSON, XML, binary, RDBM, ...
connections: file, Data Import Handler, API
sharding (separating documents into multiple
parts)
denormalized documents - (almost) no JOIN ;-(
copy field
catch all field (contains everything)
11

A document example (XML)
<doc>
<field name="id">F8V7067-APL-KIT</field> string
<field name="name">Belkin Mobile Power Cord for iPod w/ Dock</field> text
<field name="cat">electronics</field>
<field name="cat">connector</field> multivalue
<field name="price">19.95</field> float
<field name="inStock">false</field> boolean
<field name="store">45.18014,-93.87741</field> geo point
<field name="manufacturedate_dt">2005-08-01T16:30:25Z</field> date
</doc>
12

A document example (JSON)
{
"id": "F8V7067-APL-KIT",
"name": "Belkin Mobile Power Cord for iPod w/ Dock",
"cat": ["electronics", "connector"],
"price":19.95,
"inStock":false,
"store": "45.18014,-93.87741",
"manufacturedate_dt": "2005-08-01T16:30:25Z"
}
13

A document example (Solr4j library)
SolrServer solr = new HttpSolrServer(“http://…”);
SolrInputDocument doc = new SolrInputDocument();
doc.setField("id", "F8V7067-APL-KIT");
doc.setField("name", "Belkin Mobile Power Cord for iPod w/ Dock");
...
solr.add(doc);
solr.commit(true, true);
14

Text analysis chain
1) character filters — preprocess text
pattern replace, ASCII folding, HTML stripping
1) tokenizers — split text into smaller units
whitespace, lowercase, word delim., standard
1) token filters — examine/modify/eliminate
stemming, lowercase, stop words,
15

Text analysis chain
<fieldType name="my-text-type" class="solr.TextField">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-FoldToASCII.txt" />
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.StopFilterFactory" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
16

Text analysis result
#Yummm :) Drinking a latte at Caffé Grecco in
SF’s historic North Beach…Learning text
analysis
“#yumm”, “drink”, “latte”, “caffe”, “grecco”,
“sf”/”san francisco”, “historic” “north” “beach”
“learn”, “text”, “analysis”
17

Performing queries
1) user enters a query (+ specifies other
components)
2) query handler
3) analysis (use similar as in indexing)
4) run search
5) adding components
6) serialization (XML, JSON etc.)
18

Lucene query language
● *:* (→ everything)
● gwdg
● name:gwdg
● name:admin*
● h?ld (→ hold, held)
● name:administrator~ (→ —tor, —tion)
● name:Gesellschaft~0.6 (similarity measure)
19

● name:Max AND name:Planck
● name:Max OR name:Planck
● name:Max NOT name:Planck
● name:”Max Planck”
● name:(“Max Planck” OR Gesselschaft)
● “Max Planck”~3 (within 3 words)
→ so “Planck Max”, “Max Ludwig Planck”
20

● max planck^10 (weighting)
● price:[10 TO 20] (→ 10..20)
● price:{10 TO 20} (→ 11..19)
● born:[1900-01-01T00:00.0Z TO 1949-12-
31T23:59.0Z] (date range)
21

Date mathematics
indexing hour granularity
"born": "2012-05-22T09:30:22Z/HOUR"
search by relative time range, eg. last month:
born:[NOW/DAY-1MONTH TO NOW/DAY]
keywords:
MINUTE, HOUR, DAY, WEEK, MONTH, YEAR
22

Faceted search
Facets let user to get an overview of the
content, and helps to browse without entering
search terms (search theorists: browse and
search are equally imortant).
● term/field facet: list terms and counts
● query facet: run queries, return counts
● range facet: split range into pieces
23

Term facets
&facet=true
&facet.field=TYPE
"facet_fields":{
"TYPE":[
"IMAGE", 25334764,
"TEXT", 16990647,
"VIDEO", 702787,
"SOUND", 558825,
"3D", 21303
]
http://europeana.eu - Europeana portal
24

Term facet
Additional parameters:
● limit, offset → for pagination
● sort (by index or count) → alphabetically or frequency
● mincount → filter less frequent terms
● missing → number of documents miss this field
● prefix → such as “http” to display URLs only
● f.[facet name].facet.[parameter] → overwrites generals
25

Query facets
&facet=true&
facet.query=price:[* TO 5}&
facet.query=price:[5 TO 10}&
facet.query=price:[50 TO *]
"facet_counts":{
"facet_queries":{
"price:[* TO 5}":6,
"price:[5 TO 10}":5,
"price:[10 TO 20}":3,
"price:[20 TO 50}":6,
"price:[50 TO *]":0
},
26

Query facets (zooming)
From centuries to years
http://pcu.bage.es/ Catálogo Colectivo de las Bibliotecas de la Administración General del Estado
27

Range facet
&facet=true&
facet.range=price&
facet.range.start=0&
facet.range.end=50&
facet.range.gap=5
"facet_ranges":{
"price":{
"counts":[
"0.0", 6, "5.0", 5,
"10.0", 0, "15.0", 3,
"20.0", 2, "25.0", 2,
"30.0", 1, "35.0", 0,
"40.0", 0, "45.0", 1
],
"gap":5.0,"start":0.0,"end":50.0
}}}}
28

Hit highlighting
?...&hl=true
&hl.fl=name
&hl.simple.pre=<em>
&hl.simple.post=</em>
"highlighting": {
"SP2514N": { ←ID
"name": [
"<em>SpinPoint P120
</em> SP2514N - hard
drive - 250 GB - ATA-
133"]}
29

More like this… (similar documents)
mlt (more like this)
handler:
● doc ID
● fields
● boost
● limit
● min length and
freq
http://catalog.lib.kyushu-u.ac.jp/en/ - Kyushu University library catalog
30

More like this (alternative solution)
(DATA_PROVIDER:("NIOD")^0.2 OR what:("IMAGE" OR "Amerikaanse
Strijdkrachten" OR "Luchtmacht" OR "Steden - Zie ook: Ruimtelijke ordening,
Wederopbouw, Dorpen")^0.8) NOT europeana_id:"/2021622/11607
31

Multilingual search
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
<filter class="solr.PersianCharFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="lang/en_stop.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="lang/en_synonyms.txt" />
<filter class="solr.SnowballPorterFilterFactory" language="Hungarian" />
32

Multilingual search strategies
● Separate fields by language
→ title_en:horse OR title_de:horse OR title_hu:horse
● Separate collections (core, shard) per language
all core has language settings and same field names
→ /select?shards=.../english,.../spanish,.../french
&q=title:horse
● All language in one field (from Solr 5.0)
→ title:(es|escuela OR en,es,de|school OR school)
33

Multilingual search
query → translation API → rewrited query
horse → (Hauspferd OR Ló OR Paard OR …)
34

Relevancy
The most important concepts:
● Term frequency (tf) - how often a particular term appears in a matching
document
● Inverse document frequency (idf) - how “rare” a search term is, inverse
of the document frequency (how many total documents the search term
appears within)
● field normalization factor (field norm) - a combination of factors
describing the importance of a particular field on a per-document basis
35

Relevancy
score(q,d) = Σ (tf(t in d) × idf(t)2 × t.getBoost() ×
norm(t,d)) × coord(q,d) × queryNorm(q)
where
t = term; d = document; q = query; f = field
tf(t in d) = num. of term occurrences in document1/2
norm(t,d) = d.getBoost() × lengthNorm(f) × f.getBoost()
idf(t) = 1 + log (numDocs / (docFreq +1))
coord(q,d) = numTermsInDocumentFromQuery / numTermsInQuery
queryNorm(q) = 1 / (sumOfSquaredWeights1/2)
sumOfSquaredWeights = q.getBoost()2 × Σ(idf(t) × t.getBoost())2
see: Solr in Action, p. 67
36

Debug
?...&debug=true
...
"debug":{
"rawquerystring":"hard drive",
"querystring":"hard drive",
"parsedquery":"text:hard text:drive",
"parsedquery_toString":"text:hard text:drive",
37

debug
"explain":{
"6H500F0":”
1.209934 = (MATCH) sum of:
0.6588537 = (MATCH) weight(text:hard in 2) [DefaultSimilarity], result of:
0.6588537 = score(doc=2,freq=2.0), product of:
0.73792744 = queryWeight, product of:
3.3671236 = idf(docFreq=2, maxDocs=32)
0.21915662 = queryNorm
0.8928435 = fieldWeight in 2, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
3.3671236 = idf(docFreq=2, maxDocs=32)
...
38

References
● http://lucene.apache.org/solr/
● Grainger & Potter: Solr in Action
● https://lucidworks.com/blog/
● http://blog.sematext.com/
● http://solr.pl/
● https://www.packtpub.com/all?search=solr
● http://www.slideshare.net/treygrainger
39

Apache solr

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Apache solr

Ähnlich wie Apache solr (20)

Mehr von Péter Király

Mehr von Péter Király (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Apache solr