These slide belonged to the presentation I hold to my colleagues in Göttingen as an introduction to Apache Solr open source search engine. In the structure I followed Trey Grainger and Timothy Potter excellent Solr in Action book (Manning, 2014), and I took some of the examples form there. Some others come from the examples bundeled with Solr, and from the projects I had opportunity to work with in the past (eXtensible Catalog and Europeana).
These slides don't go too deep, if you want to know more about the topic, just drop me an email, or consult with the references on the last slide.
Happy searching!
2. What is Apache Solr?
Solr is the popular, blazing-fast, open source enterprise
search platform built on Apache Lucene
2
3. ● 1999: Doug Cutting published Lucene
● 2004: Yonik Seeley published Solr
● 2006: Apache project (2007: TLP)
● 2009: LucidWorks company
● 2010: Merge of Lucene and Solr
● 2011: 3.1
● 2012: 4.0
● 2015: 5.0
History in one minute
3
4. “Sister” projects
● Nutch: web scale search engine
● Tika: document parser
● Hadoop: distributes storage and data
processing
● Elasticsearch: alternative to Solr
● forks/ports of Lucene
● client libraries and tools (Luke index viewer)
4
5. Main features I
● Faceted navigation
● Hit highlighting
● Query language
● Schema-less mode and Schema REST API
● JSON, XML, PHP, Ruby, Python, XSLT,
Velocity and custom Java binary outputs
● HTML administration interface
5
6. Main features II
● Replication to other Solr servers
● Distributed search through sharding
● Search results clustering based on Carrot2
● Extensible through plugins
● Relevance boosting via functions
● Caching - queries, filters, and documents
● Embeddable in a Java Application
6
7. Main features III
● Geo-spatial search, including multiple
points per documents and polygons
● Automated management of large clusters
through ZooKeeper
● Function queries
● Field Collapsing and grouping
● Auto-suggest
7
9. Inverted index
Index structure
Term Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 Doc7
a 0 1 1 1 0 0 0
becomming 0 0 0 0 1 0 0
beginner’s 0 0 0 0 0 1 0
buy 0 0 1 0 0 0 0
stored as a bit vectorstored as reference to a tree
structure
9
10. Indexing
Document ~ RDBM record
Fields (key-value structure):
● types (text, numeric, date, point, custom)
● indexed, stored, multiple, required
● field name patterns (prefixes, suffixes, such
as *_tx)
● special fields (identifier, _version_)
10
11. Indexing
formats: JSON, XML, binary, RDBM, ...
connections: file, Data Import Handler, API
sharding (separating documents into multiple
parts)
denormalized documents - (almost) no JOIN ;-(
copy field
catch all field (contains everything)
11
12. A document example (XML)
<doc>
<field name="id">F8V7067-APL-KIT</field> string
<field name="name">Belkin Mobile Power Cord for iPod w/ Dock</field> text
<field name="cat">electronics</field>
<field name="cat">connector</field> multivalue
<field name="price">19.95</field> float
<field name="inStock">false</field> boolean
<field name="store">45.18014,-93.87741</field> geo point
<field name="manufacturedate_dt">2005-08-01T16:30:25Z</field> date
</doc>
12
13. A document example (JSON)
{
"id": "F8V7067-APL-KIT",
"name": "Belkin Mobile Power Cord for iPod w/ Dock",
"cat": ["electronics", "connector"],
"price":19.95,
"inStock":false,
"store": "45.18014,-93.87741",
"manufacturedate_dt": "2005-08-01T16:30:25Z"
}
13
14. A document example (Solr4j library)
SolrServer solr = new HttpSolrServer(“http://…”);
SolrInputDocument doc = new SolrInputDocument();
doc.setField("id", "F8V7067-APL-KIT");
doc.setField("name", "Belkin Mobile Power Cord for iPod w/ Dock");
...
solr.add(doc);
solr.commit(true, true);
14
15. Text analysis chain
1) character filters — preprocess text
pattern replace, ASCII folding, HTML stripping
1) tokenizers — split text into smaller units
whitespace, lowercase, word delim., standard
1) token filters — examine/modify/eliminate
stemming, lowercase, stop words,
15
17. Text analysis result
#Yummm :) Drinking a latte at Caffé Grecco in
SF’s historic North Beach…Learning text
analysis
“#yumm”, “drink”, “latte”, “caffe”, “grecco”,
“sf”/”san francisco”, “historic” “north” “beach”
“learn”, “text”, “analysis”
17
18. Performing queries
1) user enters a query (+ specifies other
components)
2) query handler
3) analysis (use similar as in indexing)
4) run search
5) adding components
6) serialization (XML, JSON etc.)
18
20. Lucene query language
● name:Max AND name:Planck
● name:Max OR name:Planck
● name:Max NOT name:Planck
● name:”Max Planck”
● name:(“Max Planck” OR Gesselschaft)
● “Max Planck”~3 (within 3 words)
→ so “Planck Max”, “Max Ludwig Planck”
20
21. Lucene query language
● max planck^10 (weighting)
● price:[10 TO 20] (→ 10..20)
● price:{10 TO 20} (→ 11..19)
● born:[1900-01-01T00:00.0Z TO 1949-12-
31T23:59.0Z] (date range)
21
22. Date mathematics
indexing hour granularity
"born": "2012-05-22T09:30:22Z/HOUR"
search by relative time range, eg. last month:
born:[NOW/DAY-1MONTH TO NOW/DAY]
keywords:
MINUTE, HOUR, DAY, WEEK, MONTH, YEAR
22
23. Faceted search
Facets let user to get an overview of the
content, and helps to browse without entering
search terms (search theorists: browse and
search are equally imortant).
● term/field facet: list terms and counts
● query facet: run queries, return counts
● range facet: split range into pieces
23
25. Term facet
Additional parameters:
● limit, offset → for pagination
● sort (by index or count) → alphabetically or frequency
● mincount → filter less frequent terms
● missing → number of documents miss this field
● prefix → such as “http” to display URLs only
● f.[facet name].facet.[parameter] → overwrites generals
25
26. Query facets
&facet=true&
facet.query=price:[* TO 5}&
facet.query=price:[5 TO 10}&
facet.query=price:[10 TO 20}&
facet.query=price:[20 TO 50}&
facet.query=price:[50 TO *]
"facet_counts":{
"facet_queries":{
"price:[* TO 5}":6,
"price:[5 TO 10}":5,
"price:[10 TO 20}":3,
"price:[20 TO 50}":6,
"price:[50 TO *]":0
},
26
27. Query facets (zooming)
From centuries to years
http://pcu.bage.es/ Catálogo Colectivo de las Bibliotecas de la Administración General del Estado
27
30. More like this… (similar documents)
mlt (more like this)
handler:
● doc ID
● fields
● boost
● limit
● min length and
freq
http://catalog.lib.kyushu-u.ac.jp/en/ - Kyushu University library catalog
30
31. More like this (alternative solution)
(DATA_PROVIDER:("NIOD")^0.2 OR what:("IMAGE" OR "Amerikaanse
Strijdkrachten" OR "Luchtmacht" OR "Steden - Zie ook: Ruimtelijke ordening,
Wederopbouw, Dorpen")^0.8) NOT europeana_id:"/2021622/11607
31
33. Multilingual search strategies
● Separate fields by language
→ title_en:horse OR title_de:horse OR title_hu:horse
● Separate collections (core, shard) per language
all core has language settings and same field names
→ /select?shards=.../english,.../spanish,.../french
&q=title:horse
● All language in one field (from Solr 5.0)
→ title:(es|escuela OR en,es,de|school OR school)
33
35. Relevancy
The most important concepts:
● Term frequency (tf) - how often a particular term appears in a matching
document
● Inverse document frequency (idf) - how “rare” a search term is, inverse
of the document frequency (how many total documents the search term
appears within)
● field normalization factor (field norm) - a combination of factors
describing the importance of a particular field on a per-document basis
35
36. Relevancy
score(q,d) = Σ (tf(t in d) × idf(t)2 × t.getBoost() ×
norm(t,d)) × coord(q,d) × queryNorm(q)
where
t = term; d = document; q = query; f = field
tf(t in d) = num. of term occurrences in document1/2
norm(t,d) = d.getBoost() × lengthNorm(f) × f.getBoost()
idf(t) = 1 + log (numDocs / (docFreq +1))
coord(q,d) = numTermsInDocumentFromQuery / numTermsInQuery
queryNorm(q) = 1 / (sumOfSquaredWeights1/2)
sumOfSquaredWeights = q.getBoost()2 × Σ(idf(t) × t.getBoost())2
see: Solr in Action, p. 67
36