2. Agenda
Introductions
Enterprise search
What is Solr, why choose it?
Solr Terminology
Main Solr Features
How it works
Anatomy of a Query
Scalbility
Case studies
Sparebank1
Komplett Group
3. Enterprise Search
Search has become mission critical for most enterprises
Intranet
Web presence
E-commerce
Exponential growth of data
Cost of not finding information
Knowledge (sharing)
Time
Money
Information blackhole
4. What is Solr?
Official definition:
“Solr is an open source enterprise search platform
based on the
Lucene Java search library, with an HTTP
interface using XML,
JSON or other formats. It provides hit
highlighting, faceted
search, caching, replication, a web
administration interface and
many more features. It runs in a
Java servlet container such as
Apache Tomcat.”
http://lucene.apache.org/solr
5. What is Solr?
Open-source, license-free search engine
Uses Apache Lucene library and adds enterprise search server
features and capabilities
Web based application that processes requests and returns
responses via HTTP
Easy scalability and great performance
Industry-tested worldwide
Modern solution architecture based on XML and Java – easy to work
with
Well integrated with the ecosystem around Big Data, such as
Hadoop (also Nutch, Tika).
6. Why choose Solr?
“Buy” > Build
Open source vs. Commercial solution
Open source software is free
Licensed software can be very expensive
High quality and easily modifiable relevancy
Very fast query and indexing performance
Highly flexible data processing/transformation
7. Why choose Solr?
Some challenges unique to open source…
No guaranteed support or bug fixing from community
No formal quality control or support for upgrades
Limited support for less experienced developers
Some benefits unique to open source…
Widely used and tested
Access to source code
Access to development versions and unreleased patches
Ultimately search is a specialised field and requires specialists
9. Main Solr Features
Full text search
Field search
Number and date searching
Facets
Spelling assistance – “Did you mean…?”
Replication
Master/Slave architecture
Related hits
Query completion
Admin GUI
10. How it works
Easy configuration through XML
schema.xml
solrconfig.xml
Documents are POSTed via HTTP to Solr
Add/update
Delete
Commit
Queries and response are also sent via HTTP
Choice of formats
11. Anatomy of a Query
Common parameters
Start, rows, fl, fq, sort
http://wiki.apache.org/solr/CommonQueryParameters
?
q=*:*&start=0&rows=10&fl=title&fq=collection:popular&s
ort=title asc
Slightly more advanced
&facets
&qf
12. What is Facetting?
Navigation/discovery technique
Tally of docs for each distinct field value
Parameters
&facet=true
&facet.field=category
And so much more…
13. Scalability
Architecture goals:
More queries per second (qps)
Faster query execution
Bigger indexes
Faster indexing
Scaling options
Multicore
Replication
Sharding
14. Scalability - Multicore
Having more than one Solr in one Solr webapp
<solr persistent = “true” sharedLib = “lib” >
<cores adminPath=“/admin/cores”>
<core name=“core0” instanceDir=“core0” />
<core name=“core1” instanceDir=“core1” />
</cores
</solr>
http://localhost:8080/solr/admin/cores?action=...
STATUS
CREATE
SWAP
15. Scalability - Replication
Basic architecture – indexing/querying handled by one instance
1:1 Master/slave
Indexing
Querying
1:N Master/slaves
Different user groups
16. Scalability - Sharding
Distributed index
N masters with index split between them
Simple hashing to choose index
Sharding + replication
N masters with M slaves each
More shards = faster execution time
More slaves = higher average QPS
&shards=solr1:8983/solr,solr2:8983/
solr&indent=true&q=ipod+solr
18. SpareBank1 - Background
SpareBank1 Gruppen
19 individual localised bank portals and one parent front page
Boost 25 umbrella project
Semantic URLs: https://www2.sparebank1.no/9898/3_privat?
_nfpb=true&_nfls=false&_pageLabel=page_privat_innhold&pId=1233
149354625&_
New search interface
Banking app
CMS with no easy way of tracking individual banks’
publications
Mass duplicates
Access to irrelevant data
22. Komplett - Background
Komplett NO, SE, DK… inWarhouse.se, MPX
Existing Solr solution
Mile long query with boosting per field
Poor relevance
Peripherals/accessories ranked higher than products
Limited faceting
No query completion or spellcheck
Sloooooow indexing
23. Komplett - Requirements
Superior and customisable relevance model
Much more comprehensive indexing of products and specifications
Spellcheck
Query completion
So much more faceting
Who &#x2013; swiss, s&#xE9;bastien muller, ex solr newbie, 1 yr w/ Solr almost daily, several projects\nWhat &#x2013; work for findwise &#xF0E0;&#xF020;Findwizard, information access consultant&#x2026;.. Enterprise search!\nWhere &#x2013; Oslo for a year\nWhy &#x2013; Oslo Solr Meetup community &#xF0E0;&#xF020;semi regular meetings at a pub in oslo\n\nINTRODUCTORY TALK\n
Internally and/or externally, both for finding information or finding who has the information&#x2026;. Ecommerce fail w/out search, cant find what you want to buy = no sale\nA lot of which is unstructured\nAccording to research performed by google approx. 85% of organisations can barely access less than 50% of the data they produce\n
\n
General description &#x2013; the &#x201C;sales pitch&#x201D;\n\nWeb based app &#x2013; runs in a servlet container and is deployed as a java war, works with all major application servers such as Tomcat or Jetty\n
No point in building when there&#x2019;s open source and licensed options readily available\nOpen source = free but might end up spending a lot to get it to do what you want &#x2013; no vendor lock in, complete customisability\nLicensed = expensive and likely to spend a lot still\n
Although it being open source allows for a very low barrier for adoption there is no Service Level Agreement with an open source community\nOpen Source community based project likely to yield better long term QA testing, more personally invested in the quality of the project, but life > *\n\nIssue tracking is public\nAccess to source code &#x2013; but the documentation isn&#x2019;t necessarily comprehensive&#x2026; google :D\n
Inverted index &#x2013; like a book = list of keywords paired with location -> makes for v. fast queries rather than searching through documents for specific terms\n\nDocument = collection of fields with optional boosting values &#xF0DF;&#xF020;book&#x2026;. Page&#x2026; database entry etc&#xF020;&#xF0E0;&#xF020;represented by a single result or hit\n\nSingle/multi valued\n\nStored = original pre-analysis value stored and returnable by queries, necessary for some features, increases index size &#xF0E0;&#xF020;store and retrieve, unless indexed too\n\nIndexed = searchable and facetable, unless stored will not be returned in a search\n\nTokenization = breaking up text sequences, filtering and trasnforming to generate &#x201C;terms&#x201D; that are tied to a specific field\n\nRestricts the search space by creating subset of indexed documents against which queries can be made\n\nContributes to relevance calculations in query time, can be customised\n
\n
Schema = define data and field types\n\nSolrconfig = search components, replication, request handlers etc\n\nDEMO schema/solrconfig in notepad++ and DEMO POSTing from netbeans/browser\n
Sort on relevance score, value of fields&#x2026;.\n\nIf there is a tie docs are sorted by date added (indexed time)\n\nMore shiney examples to follow\n
Sort on relevance score, value of fields&#x2026;.\n\nIf there is a tie docs are sorted by date added (indexed time)\n\nMore shiney examples to follow\n
Sort on relevance score, value of fields&#x2026;.\n\nIf there is a tie docs are sorted by date added (indexed time)\n\nMore shiney examples to follow\n
Single solr instance with separate configurations (schema and config) and indexes while maintaining the convinience of unified administration\n\nEasy to add new cores or even replace cores with each other\n\nSTATUS, create, swap, unload, alias rename\n\nAtomically swaps the names used to access two existing cores. This can be useful for replacing a "live" core with an "ondeck" core, and keeping the old "live" core running in case you decide to roll-back.\n
Basic &#x2013; good starting point, fine for a small index with few updates and low query load as all updates will slow down querying\n\nMaster/slave &#x2013; indexing on one, querying on the other, all replications slow down indexing, can modify replication interval &#x2013; improves query speed/qps\n\n\n1:N &#x2013; more qps, no more query speed necessarily -> both require index to remain on 1 machine\n
Updates go to one of N machines &#x2013; unique key field must be unique across all shards &#x2013; couple of features aren&#x2019;t supported eg. More like this and joins\n\nMakes it easier to rebalance index operations across more servers\n\nshards=solr1:8983/solr,solr2:8983/solr&indent=true&q=ipod+solr -> shards parameter syntax -> can be added to a requestHandler specifically for shards\n
Norwegian Bank and Norwegian based e-commerce group of sites\n
One bank portal of about 1.5k docs, c. 50% were duplicates\n\nGroup publications made globally available via CMS but individual banks are under no obligation to publish articles and there&#x2019;s no indication as to whether they had or not\n
Norwegian Bank and Norwegian based e-commerce group of sites\n
One bank portal of about 1.5k docs, c. 50% were duplicates\n\nGroup publications made globally available via CMS but individual banks are under no obligation to publish articles and there&#x2019;s no indication as to whether they had or not\n
Semantic FAIL\n
\n
\n
One bank portal of about 1.5k docs, c. 50% were duplicates\n\nGroup publications made globally available via CMS but individual banks are under no obligation to publish articles and there&#x2019;s no indication as to whether they had or not\n