Presentation from PHP UK 2010. Despite being a key method of navigation on many sites, search functionality often gets the short end of the stick in development, either by handing the job over to Google or just enabling full text search on the appropriate column in the database. In this talk we will look at how full text search actually works, how to integrate local text search engines into your PHP application, and how it's possible to actually provide better and more relevant results than Google itself, at least for your own site.
9. Inverted Index
Term Documents
best 1 (4, 16), 4 (422), 129 (344) ...
what 24 (50, 98), 75 (33, 208) ...
would 99 (32, 599), 201 (344) ..
... ...
9
10. Boolean Query Merge
Query: Best Western Hotel
best 1 4 129 298 305 338
western 4 95 194 204 298 305
working 4 298 305
hotel 2 40 200 298 355 402
Result: Document 298
10
11. Lorem ipsum dolor sit amet,
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodalesipsum. Aliquam vel condimentum Lorem ipsum dolor sit amet,
quis neque.
ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec
consectetur adipiscing elit. Sed sit amet ante
consectetur elit metus. Nulla eleifend
Curabitur ornare feugiat ornare. Donec vitae enim elementum semper sodales quis
consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum ipsum. Aliquam vel condimentum neque.
vestibulum, justo vel egestas elementum,
tincidunt massa et euismod. Vestibulum sit amet,
Lorem ipsum dolor Curabitur ornare feugiat ornare. Donec
vestibulum, justo consectetur elementum,elit.enim sit ametquam, vel gravida est
vel egestas adipiscing purus
Sed
ornare
ante consectetur elit metus. Nulla eleifend
purus enim ornarevitae enim elementum sempernibh.
quam, vel gravida est vel sodales quis
enim
tincidunt massa et euismod. Vestibulum
Lorem ipsum dolor sit amet, consectetur enim vel nibh.
Lorem ipsum dolor ipsum. Aliquam vel condimentum neque. fringillavestibulum, justo vel egestas elementum,
sit amet, Nam non eros nisi, eget justo.
consectetur adipiscingCurabitur sit ametfeugiat ornare. Donec mauris vehicula enim ornare quam, vel gravida est
elit. Sed ornare ante purus
adipiscing elit. Sed sit amet ante vitae enim vitae enim elementum consectetur elitjusto.Fusce vel risus vitae
Nam non eros nisi,semper sodalesmetus. Nulla eleifend
eget fringilla quis enim vel nibh.
Fusce vel risus condimentum neque. facilisis sit amet in mi. Nulla ut turpis id
ipsum. Aliquam velvitae maurismassa et euismod. Vestibulum
tincidunt vehicula
elementum semper sodales quis ipsum. Aliquam facilisis sit amet in mi. Nulla ut turpis felis sollicitudin dictum sed nonNam non eros nisi, eget fringilla justo.
Curabitur ornare feugiat ornare. Donec velid
vestibulum, justo egestas elementum, ipsum.
Praesent gravida nulla, sed blandit leo.
ut risus est
Lorem ipsum dolor sit amet, Lorem ipsum dolor sit amet,
consectetur elit metus.purus enim ornare quam, vel volutpat laoreet lacus,Fusce vel risus vitae mauris vehicula
felis sollicitudin dictum sed non ipsum.
Nulla eleifend
vel condimentum neque. Curabitur ornare enim Vestibulum Curabitur ut
consectetur adipiscing elit. Sed sit amet ante
consectetur adipiscing elit. Sed sit amet ante
tincidunt massa risus nulla, sed nibh. leo.consectetur arcu vestibulum vel.facilisis sit amet in mi. Nulla ut turpis id
Praesent ut et euismod.vel blandit
ut sodales Donec
Curabitur volutpat laoreet lacus, vitae enim elementum semper vitae enim elementum semper sodales quis
quis
felis sollicitudin dictum sed non ipsum.
vestibulum, justo vel egestas elementum, dapibus fringilla arcu, et semper lacus
feugiat ornare. Donec consectetur elit metus. Nam non vel. ipsum. Aliquam vel condimentumLorem ipsumut risussit amet, blandit leo.
consectetur arcu vestibulumeros nisi, eget fringilla justo.
purus enim ornare quam, vel gravida est Donec ipsum. Praesent vel condimentum neque.
neque.
Aliquam dolor nulla, sed
arcu, vel risusCurabitur ornare feugiat ornare.consectetur adipiscing elit. Sed Donec ut
Curabitur ornare volutpat laoreetsit amet ante
Donec
enim dapibus fringilla Fusce et sempervitae mauris vehicula
vel nibh. lacus Curabitur feugiat ornare. lacus,
consectetur elitut turpisNulla eleifendenim elementumNulla eleifend quis
metus. id consectetur elit metus. semper sodales Donec
Nulla eleifend tincidunt massa et euismod. facilisis sit amet in mi. Nulla vitae consectetur arcu vestibulum vel.
tincidunt massa et euismod. Vestibulum massa et euismod. Vestibulum lacus
tincidunt
Nam non eros nisi, eget fringilla justo. dictum sed non ipsum.
felis sollicitudin ipsum. dapibus fringilla arcu, et semper
Aliquam vel condimentum neque.
vestibulum, justo vel egestas elementum, ornare vel egestas elementum,
vestibulum, justo feugiat ornare. Donec
Vestibulum vestibulum, justo vel egestas Fusce vel risus vitae mauris vehicula nulla, sed blandit leo.
Praesent ut risus
purus
Curabitur
Curabitur volutpat enim ornare quam, vel gravidaenim ornare quam, vel gravida est
purus est elit metus. Nulla eleifend
facilisis sit amet in mi. Nulla ut turpis id laoreet lacus, ut consectetur
enim vel nibh.vel. Donec
consectetur arcu vestibulum enim vel nibh. et euismod. Vestibulum
elementum, purus enim ornare quam, vel felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, dapibus fringilla arcu, et semper lacus
sed blandit leo.
tincidunt massa
vestibulum, justo vel egestas elementum,
Nam non eros nisi, eget fringilla justo. eros nisi, eget fringilla justo.est
Nam non ornare quam, vel gravida
gravida est enim vel nibh. Curabitur volutpat laoreet lacus, ut purus enim
Fusce vel risus vitae mauris vehicula vel nibh. vitae mauris vehicula
Fusce vel risus
enim
Lorem ipsum dolor sit amet, vel. Donec
consectetur arcu vestibulum
facilisis sit amet in mi. Nulla ut turpis id amet in mi. Nulla ut turpis id
facilisis sit
consectetur adipiscing elit.et semper lacus sollicitudin dictum sed non ipsum.
dapibus fringilla arcu, Sed sit amet ante
felis felis sollicitudin dictum sed non ipsum.
Nam non eros nisi, eget fringilla justo.
Nam non eros nisi, eget fringilla justo. Fusce vel vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Praesent ut risus nulla, sed blandit leo. utrisus vitae mauris vehicula
Praesent risus nulla, sed blandit leo.
Fusce vel
Curabitur volutpat laoreet lacus, ut
Curabitur volutpat laoreet lacus, ut
Curabitur ornare feugiat ornare. Donec consectetur arcu vestibulum vel. Donec sit arcu vestibulum vel. turpis id
facilisis amet in mi. Nulla ut
risus vitae mauris vehicula facilisis sit amet in Lorem ipsum dolor sit amet,
consectetur
consectetur elit metus. Nulla eleifendadipiscing elit. Sed sit amet ante felis sollicitudin dictum sed non ipsum.
consectetur
Donec
Lorem ipsum dolor sit amet, dapibus fringilla arcu, etLorem ipsum dolor sit amet, et semper lacus
semper lacus fringilla nulla, sed blandit leo.
dapibus ut risus arcu,
consectetur adipiscing enimSed
vitae elit. elementum ante quis Praesent
tincidunt massa et euismod. Vestibulumsit amet semper sodalesconsectetur adipiscing elit. Sed sit amet ante
mi. Nulla ut turpis id felis sollicitudin dictum vestibulum, justo vel egestas elementum,
vitae enim elementum semper sodales quis vitae
Curabitur volutpat laoreet lacus, ut
ipsum. Aliquam vel condimentum neque. enim elementum semper sodales quis
purus enim ornare quam,vel condimentum feugiat ornare. Donec
Curabitur ornare neque.
vel gravida est consectetur arcu vestibulum vel. Donec
sed non ipsum. Praesent ut risus nulla, sed ipsum. Aliquam
enim vel nibh.
Curabitur ornare feugiat ornare. metus.
ipsum. Aliquam vel condimentum neque.
consectetur elit Donec Nulla eleifend Curabiturdapibus feugiat ornare.et semper lacus
ornare
fringilla arcu,
Donec
tincidunt massa et euismod. Vestibulum
blandit leo. Curabitur volutpat laoreet lacus, ut consectetur elit metus. Nulla eleifend
vestibulum,Loremvel egestas elementum,
Nam non eros nisi, eget fringilla justo. justo ipsum dolor sit amet,
tincidunt massa et euismod. Vestibulum
consectetur elit metus. Nulla eleifend
tincidunt ipsum dolor sit amet,
Lorem massa et euismod. Vestibulum
purus enim ornare quam, vel gravidaSed sit amet ante vel egestas elementum,
consectetur adipiscing elit. est
Fusce vel risus vitaejusto vel egestas elementum,
vestibulum, mauris vehicula
consectetur arcu vestibulum vel. Donec dapibus enim vel vitae enim est
nibh.
sit amet in ornare quam, vel id
vestibulum, justo
consectetur adipiscing elit. Sed sit amet ante
facilisis purus enim mi. Nulla ut turpisgravida elementum semper sodales quis
purus enim ornare quam, vel gravida est
vitae enim elementum semper sodales quis
felis sollicitudin dictum sed non ipsum. Aliquam vel condimentum vel nibh. vel condimentum neque.
enim vel nibh.
ipsum.
enim
neque.
fringilla arcu, et semper lacus egestas non. Praesent ut risus nulla, sed blandit leo. nisi, eget fringilla
Nam non eros
ipsum. Aliquam
Curabitur ornare feugiatjusto. Donec
ornare.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
Curabitur volutpateros nisi,lacus, fringilla vitae mauris vehicula consectetur elit metus. Nulla eleifend
Nam non laoreet egetvel risus justo.
Fusce ut Nam non eros nisi, eget fringilla justo.
Quisque eu purus ut lacus egestas dapibus. consectetur arcu vestibulum vel. Donec inmassa et euismod. Vestibulum
tincidunt
Fusce vel risus vitae mauris amet mi. Nulla ut turpis tincidunt massavitae mauris Vestibulum
facilisis sit vehicula id
Fusce vel risus et euismod. vehicula
felis sollicitudin dictum vel egestas vestibulum,amet in mi. Nulla elementum,
vestibulum, justo
dapibus fringilla arcu, et semper lacus turpis id sed non ipsum.
facilisis sit amet in mi. Nulla ut
elementum,
facilisis sit justo vel egestas ut turpis id
Integer in velit id est dictum bibendum in id mi. purus enim ornareblandit vel gravida est
felis sollicitudin dictum sed non ipsum. sed
Praesent ut risus nulla,
enim vel nibh.
quam, leo.
purus enim ornare quam, velnon ipsum.
felis sollicitudin dictum sed gravida est
Praesent ut risus Curabitur volutpat laoreet lacus, ut enim vel ut risus nulla, sed blandit leo.
nulla, sed blandit leo. Praesent nibh.
consectetur arcu vestibulum vel. Donec
Curabitur volutpat laoreet lacus, ut Curabitur volutpat laoreet lacus, ut
dapibus Nam nonarcu, nisi, eget fringilla justo. arcu vestibulum vel. Donec
fringilla eros
consectetur arcu vestibulum vel. Donec et semper lacus consectetur
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
dapibus fringilla arcu, et semper lacus Fusce velfringilla arcu, et semper lacus
dapibus risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
dapibus fringilla arcu, et semper lacus
17. MySQL Full Text Search
CREATE TABLE example (
id INT(11) NOT NULL auto_increment,
title VARCHAR(255),
content TEXT,
PRIMARY KEY(id),
FULLTEXT(title,content)
) Engine=MyISAM;
INSERT INTO example (title, content) VALUES
('Mikko & Bacon','Mikko loves bacon'),
('Marcello & Bacon','Marcello hates bacon'),
('Jo & Sausages','Johanna loves sausages'),
('Hollywood & Garlic','Lorenzo hates garlic'),
('James & Cheddar','James is keen on cheeses');
17
18. MySQL FTI Query
SELECT * FROM example WHERE
MATCH(title,content) AGAINST('loves bacon');
+----+------------------+------------------------+
| id | title | content |
+----+------------------+------------------------+
| 1 | Mikko & Bacon | Mikko loves bacon |
| 2 | Marcello & Bacon | Marcello hates bacon |
| 3 | Jo & Sausages | Johanna loves sausages |
+----+------------------+------------------------+
3 rows in set (0.00 sec)
18
19. Looking At The Index
/var/lib/mysql/fttest# myisam_ftdump
example 1
Total rows: 5
Total words: 17
Unique words: 14
Longest word: 9 chars (hollywood)
Median length: 5
Average global weight: 1.176117
Most common word: 2 times, weight: 0.405465
(bacon)
19
66. Image Credits
Title http://www.flickr.com/photos/generated/2084287794/
What Do You Want http://www.flickr.com/photos/the_justified_sinner/
You Are Here 2498066986/
http://www.flickr.com/photos/alecvuijlsteke/2692475420/
Integrating Search http://www.flickr.com/photos/squeaks2569/3700355684/
Sphinx http://www.flickr.com/photos/generated/2084287794/
Lucene http://www.flickr.com/photos/mypanda/7731447/
Swish-e http://www.flickr.com/photos/ryan_fung/2239687100/
Solr http://www.flickr.com/photos/m-j-s/2724756177/
Xapian http://www.flickr.com/photos/olibac/3522056495/
Using Search http://www.flickr.com/photos/eneas/175027945/
Improving Search http://www.flickr.com/photos/x-ray_delta_one/3928200642/
Search Performance http://www.flickr.com/photos/maisonbisson/1634408/
Large Scale Search http://www.flickr.com/photos/zedzap/3663508847/
66
68. Thank you!
Ian Barber
@ianbarber
http://phpir.com
ian@ibuildings.com
http://joind.in/talk/view/1462
Hinweis der Redaktion
Contact Details
This is a question we’d often like to ask our users
But with search, they tell us
Search is about getting content to users that want it
Searching users are Active and engaged
Give them what they want and they are more likely to
Buy, Read, Comment, Share etc.
This talk covers
how full text search works,
looks at some different options for integration
looks at making it better
Time for questions at the end, but one does spring to mind now:
Why search, why not let google do it?
Private, intranet, FB inbox, offline
Bad at, twitter for a long time, blogs for a long time
Product focus, like amazon
Speed of update, like a forum
Now, lets look at how a full text search operates.
Search Engine Structure
Raw Text
Documents (add url, title, split up etc.)
Text Analysis
Index
Query Parser
Query
Results
Search UI
Simplified structure of a search engine.
Start with pool of raw data, chunked into documents
Analyser processes text in docs , Index stores
Other side: Search UI
Query parsed by query parser, like anlyser,
Searched on index and Results sorted and returned
Tokenising is taking a document and splitting it into tokens to index.
Can be difficult, even with space char.
Commas - remove punctuation - then send 1.1 mil to 11 mil!
Hyphens
Apostrophes
That said, starting with something simple isn’t a bad idea.
Here we look for continuous sequences of word chars
Capture with offset, which is for phrase matching.
More advanced SEs have better tokenisation: & in AT&T
Some instead have buzzwords file, specific terms: C++
Pair extracted tokens with assigned doc ID
Filter stop words - an, the, of - don’t distinguish
Position info included
Invert and merge pairs, so terms -> doc
Positions still stored, represented by ()
e.g best @ 4 and 16 in doc 1.
Often stored separate, or just a straight count
List of docs == posting list
Enough to start a search
Take search query and tokenise the same way. Important!
For each term we array_intersect.
Can do boolean searches by doing array union for OR etc.
BUT no RANK - any result with all words as good as other
Must store importance of terms to documents - weight
The weighting scheme includes two measures
TF - term frequency, the count of terms in the document
IDF, inverse document frequency, the rareness of the term in the collection
Simple but usable weight algo, basis of most.
TF - Count of times term appears
IDF - total docs / docs with term, 10 total / 3 with term. Log to smooth
Store this score with the document in the posting list for the term
Normalise scores over a doc to acct for length - but still boosts short text
TF-IDF PHP code
TF-IDF PHP code
Document is position in N dimension space
One dimension for each term ever seen
Mostly 0
Normalised to length 1 (sqrt of sum of sqrs of vals)
Just look at 2 terms here to keep it simple
Here, rather than just looking for matches
We accumulate a score for each matching document
246 is our highest scoring document, picking up two good scores
But 120 makes it in at number three, despite not having ‘best’ in it.
For a 2 term, 2 dimension case, that looks like this.
Calculate cosine of angle between with dot prod
Similarity - 1 = same, 0 = orthogonal (no shared terms)
We can treat a query as a new document
The documents it is most similar to are the best results
Only need to compare to documents that share terms -rest will be score 0
Look at query terms, retrieve posting list from index
Treat query term weights as 1 - incorrect, but ok for relative results
Index merge, and calc dot product by summing weights.
So, don’t need a full match
Could add phrase search, or positional bias.
Two main question
Where does the data come from?
How is the index accessed?
Look at 6 PHP friendly engines
Each different integration method
Each with new bits of functionality
Data from a database columns in one table
Simplest of all to implement - integrate through query
Note fulltext index.
Straight vector space search impl. as described before.
Only can be used for MyISAM, not InnoDB
If you’re using postgres, tsearch built in since 8.3
MATCH AGAINST syntax
Boolean too - all engines have this, we focus on natural
Only one document has both words
Ranked in score order - MATCH AGAINST returns a float
Note there’s some tricky default config: min word length 4, and 50% fill exclusion
One interesting option is Query Expansion -
Blindly expand the search based on words returned.
Usually not a very good idea, because we want more precision that recall
Precision is quality of results, recall is completeness
In this case it’s expanded to lorenzo, because of marcello’s hatred for bacon
Can actually interrogate the index
myisam_ftdump
Run from the database directory
However, lets say you want to search on a normalised schema directly - multiple tables
Using sphinx you can index a more complex query
Used on craigslist, and apparently on The Pirate Bay
There is a PHP API for access, or extension pecl/sphinx
Same interface but faster
Once installed, setup sphinx.conf file
Top: Connection Stuff - also works with postgresql
Indexing on sql_query - could use view, complex etc.
Adding attributes - non indexed elements of a doc - Numeric or timestamp only in sphinx.
Using multi valued attributes, support tags many to many
Other options, such sql_query_pre or post
Next tell sphinx about the index
Minimum length of indexed word
Prefix for wild card search - infix anywhere, prefix end
We also enable a stemmer
Stemming consistently collapses different forms of the same word to a stem
Here each version is reduced to happen, but not always an english word is generated, just a consistent one
This allows us to match more words, and is often, but not always, helpful
The most common algorithm is the Porter stemmer, there is a PHP implementation on the site
Indexer command to build index
Might lock DB table, there is a ranged table work around
Command line search, defaults to require all
Stemmer - love vs loves
Last line - start indexing daemon
Match any word
Wildcard search - prefix search,
Returns both ‘bacon’ docs
Add filters - limiting to certain values of attributes
Now we just get 1 result
Sphinx can be built into mysql as a table type, and queried via a where arg
From the other end - Swish is easy to plug in to existing system at short notice
Swish-e is an engine with a long pedigree, and a PHP extension.
Used by quite a few universities.
Doesn’t support multibyte charsets, which is a bit of a downside.
Great for combinations where you have a bunch of word docs or similar documents, and a website, and you want to search both.
First ‘fs’ for file system index - we create a conf file for indexer
In the conf we tell it where to look for files
FileFilters extract text from non-text formats doc/pdf
Can specify IndexDir multiple times different doc stores
Requires wv ware and xpdf
Apache Tika
Includes an effective web crawler, another way to get data
Getting it through the web loses some of the advantages
Can plug into website no real control over
Mode is prog to call out to the spider script
Index file is different name
Being able to query across the two indexes is very handy
Here we search fs and www indexes and give combined results
Can use various filters to limit search to parts of HTML documents
Or filter on file system paths
Now we’ll look at engines where we index from within PHP
Lucene, apache foundation search engine
Very succesful, but has ports instead of bindings
Native PHP port in Zend Framework, Zend Search Lucene
Hook right into the application, easy addition/plugin
Lots of control, easy to add metadata/attributes
Lucene calls them fields: string keys, multiple value types
Text indexed and original stored - unstored not
Index compatible w/ Java lucene 2.3 - can index java, search PHP
Querying is straightforward, and quick.
The scores are only really interesting as a relative value
Includes some handy utilities such as HTML doc parser
Spits out various fields such as title and body auto
Allows you to add other fields as required
Advantage of PHP - easy to hack at, add new doc types
HOWEVER - doesn’t scale to large collections so you may prefer to use one the Java based versions... and the easiest way is with Solr
Solr uses java lucene - wraps in REST+XML/JSON web service
Convenient for all the usual SOA reasons.
Solr is in use by CNET, digg, netflix and other high profile sites.
There is a PHP extension, or a PHP client API
Not massively different from ZSL.
Solr needs you to create a schema first, to define the fields of docs
Note the client commit down the bottom.
Until a document is committed
Hardly know you’re using a webservice
Searching is similar
XML based response format means a more complex return struct
Solr is great for larger scale collections
Provides good admin functionality - enterprise friendly
Our last engine is Xapian
High performance C based search engine.
There is a Solr like service called Flax, but we’ll look at the engine directly.
PHP SWIG based extension and low level API
Gives some cool features, and a lot of control
Creates database on FS, or can be accessed remote
Separation between the document and the indexer
Integration of stemmer - english here
We have an numeric indexed attribute, referred to as a ‘value’ here, for the title
Xapian index (local etc.)
The searching is more complicated
We have more control -
STEM_SOME, don’t stem words that start with a capital letter (proper nouns)
Xapian query
Retrieving the result relies on these functions wrapping around C iterators.
Note the percentage score value - overtly relative, but can be thresholded if needed
Xapian query result
We have search engine
We know where data coming from
How can we improve results
Link text can be a great source of keywords
To use a classic example, from one of the early papers about google, if someone types ‘big blue’ into google, one of the top results is IBM.com.
But the page it points to doesn’t contain the phrase
Things link to it that do contain that phrase, and Google index against it.
Big win for things like images and videos, where there may be no text
Need to parse document
Easy in PHP with the DOM parser
We could then add these to the index, as a new field on a document
ZSL has a built in html document type, but the getLinks function doesn’t include anchor text
Anchor text extract
The next idea is zone weighting.
This is a page from my blog
I know what’s important on this page - 1 to 3
Google has to guess, based on appearance
Green = boilerplate - don’t index
Index these zones as fields, and weight differently
If we break our content down into fields,
We can set different ‘boost’ values on them
Boosts > 1 more important, < 1 less important
E.G. de-emphasise comments
Document Weight - Importance, Authority
In general - not tied to specific query
Page rank - but that wont work on small collection
Comments - &#x201C;great post&#x201D;, comment count
Inbound visitors
Retweets - Google uses a UserRank PR type calculation on follower counts
Similar to zones, boost at document level
The default is 1
Adding one 100th for each comment
This of course could be tuned for individual circumstances
Got engine, got data, got good results
Now, look at ways to improve search user experience
With UI - do what other websites do
With search - do what google et al do
Summaries or snippits show a selection of the page
Sphinx build highlights
Most search engines have some support for this.
With Sphinx here, we can pass the query and index name to the BuildExcerpts function to get highlighted contextual snippits
getTextFromDB is just a pretend function that would wrap retrieving the raw full text.
We can do by storing some of the original text in SE
We&#x2019;ve added a StoreDescription based on the body, for 1000 characters
This will appear in the result object as swishdescription.
We may want to index more, then choose the bit we display based on the presence of query words.
Google highlighted search terms on summaries
Can do on whole document as well
Easy to do in many engines
ZSL highlight matches - could use stored field or external
HighlightMatches without fragment will add HTML headers
Spelling correction is a really handy function
Important to correct to known words from the index
Rather than default dictionary
Xapian example - set flag on indexer & queryparser.
We had an index based on PHP documentation
Have mistyped str_replace and strcmp
Function names were corrected, despite not being &#x2018;words&#x2019;
They featured in index, and had low edit distance from query
Some low quality results returned - where we might use threshold
Solr/Lucene has a similar plugin
Another useful idea is sorting result sets on other than rank
This is an example from google news
E.G. file search, email, private messages may want others (sender, date, subject)
Here we&#x2019;ve added a sort on title
Can be expensive as SEs can&#x2019;t do normal shortcuts
But normally straightforward
We&#x2019;ve got a search here on epicurious, the food and cooking site.
Shows categories and result counts
This is called faceted search, categories = facets
Document has many categories
Good for product based search
Solr was built with faceted search in mind for CNET reviews
Enable faceted mode, set one facet, &#x2018;cat&#x2019;
If we&#x2019;d been duplicating epicurious, each of the options on the left would have been a facet.
Get results plus enumeration of options in each facet + count
User can offer feedback by selecting more like this
Find documents like this one
Good for search with many meanings - &#x2018;creed&#x2019; (game, band, belief)
Example from a dissertation search engine
Generate search from document user selected
Xapian has built in, can do in Solr as well.
Top 40 most important terms extracted (can do more than one doc)
Using str_replace from index of phpdoc
Combine terms with ORs
Finds itself, and other good matches
MySQL FTI has blind query expansion, which gets more results based on the results retrieved - not as good, and hella slow!
Search can be expensive
Lots of data to process
Most engines have some sort of query cache built in
We&#x2019;ll take a quick look at some different aspects of performance.
Indexes designed for more read than write
Adding data can be expensive to a large index.
Have two indexes
Merge
Lucene uses segments automatically
Smaller index: less IO, better O/S cache, faster results
But slower update speed
Recombine segments, Merge deltas
Optimise and compress index
This can be an expensive operation though.
Try to keep index on local disk, not network
When demands too big for a single server, need to look at distributing
Replication tends not to give such a boost here, as you generally have too large an index which is too slow for single queries, rather than scale
Need to shard contents based on hash - something not searched for
Most systems have a way of working with remote backends, to give single search and sort point
The systems we&#x2019;ve talked about will all index tens of thousands of documents
Xap and Solr should handle into the millions on one server
100s of mil/billions = webscale - Challenges: Data size, rate of update
Nutch is a FOSS webscale SE/crawler created by Doug Cutting, of Lucene.
Also did hadoop: mapreduce, distributes files etc. (not being sued by google)
Used on thousands of nodes at yahoo, among others