"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Ruby Day Kraków: Full Text Search with Ferret
1. Ruby Day Kraków: Full Text Search
with Ferret
Agnieszka Figiel
25th November 2006
Ruby Day Kraków: Full Text Search with Ferret
2. Agenda
full text search implementation options
tools for ruby
ferret and acts as ferret
searching with ferret
overview of index options
multi search
more like it
Ruby Day Kraków: Full Text Search with Ferret
3. Full Text Search
A search of a document collection, which examines all of the words
in every stored document as it tries to match search words supplied
by the user.
index
tokenize all documents
filter out stop words
apply stemming
apply a term weighting scheme
search
use the index to find all documents matching a query
Ruby Day Kraków: Full Text Search with Ferret
4. Database Full Text Index
MySQL
PostgreSQL
MS SQL
Oracle
DB2
Ruby Day Kraków: Full Text Search with Ferret
5. Search Systems
Google, Yahoo
Swish-e (C, Perl API available)
Lucene (Java, ports for C, C++, .NET, Delphi, Perl, Python,
PHP, Common Lisp, ruby)
Nutch (Lucene + crawler)
Lucene-WS (Lucene via REST)
SOLR (Lucene via XML/HTTP and JSON)
Ruby Day Kraków: Full Text Search with Ferret
6. Ruby Search Systems
Hyper Estraier
Ferret
Ruby Day Kraków: Full Text Search with Ferret
7. Ferret
http://rubyforge.org/projects/ferret
a text search engine library written for Ruby. It is inspired by
Apache Lucene Java project.
Ruby Day Kraków: Full Text Search with Ferret
8. acts as ferret
http://projects.jkraemer.net/acts_as_ferret/wiki
a plugin for Ruby on Rails which builds on Ferret
search across the contents of any Rails model class
each model has its own index on disk
search multiple models
support for Rails Single Table Inheritance
index attributes or virtual attributes of a model
indexing can be customized by overriding the to doc method
find similar items (’more like this’)
Ruby Day Kraków: Full Text Search with Ferret
9. Installation
ferret gem:
gem install ferret
acts as ferret:
script/plugin install
svn://projects.jkraemer.net/acts_as_ferret/tags/stable/acts_as_ferret
Ruby Day Kraków: Full Text Search with Ferret
10. Example
YASB (Yet Another Searchable Blog)
class Post < ActiveRecord::Base
has_many :comments
end
class Comment < ActiveRecord::Base
belongs_to :post
end
Ruby Day Kraków: Full Text Search with Ferret
11. Basic post search
Let’s add a basic search on the Post model:
class Post < ActiveRecord::Base
has_many :comments
acts_as_ferret
end
Search posts:
Post.find_by_contents(search_term)
After running the first search an index will be created for the Post
model.
ALL fields are indexed if no additional options are given, including
arrays of child objects (STI).
Ruby Day Kraków: Full Text Search with Ferret
12. Limit indexed fields
To limit the fields that are indexed for a given model we can
specify their list:
acts_as_ferret :fields => [ ’title’, ’body’ ]
NOTE: after any change to index settings, the index needs to be
rebuilt.
Post.rebuild_index
Ruby Day Kraków: Full Text Search with Ferret
13. Index options
There are numerous options of customising ferret’s indexing.
Example:
acts_as_ferret( :fields => {
:title => { :boost => 2 },
:body => { :boost => 1}
}, :store_class_name => true)
This will add a boost (importance) factor of 2 to the title field,
and 1 to the body field. The class name will be stored for multiple
class searches.
Ruby Day Kraków: Full Text Search with Ferret
14. Index options: store
Value Description
:no Don’t store field
:yes Store field in its original format.
Use this value if you want to highlight
matches or print match excerpts a la Google
search.
:compressed Store field in compressed format.
Ruby Day Kraków: Full Text Search with Ferret
15. Index options: index
Value Description
:no Do not make this field searchable.
:yes Make this field searchable and tok-
enize its contents.
:untokenized Make this field searchable but do not
tokenize its contents. Use this value
for fields you wish to sort by.
:omit norms Same as :yes except omit the norms
file. The norms file can be omit-
ted if you don’t boost any fields and
you don’t need scoring based on field
length.
:untokenized omit norms Same as :untokenized except omit the
norms file.
Ruby Day Kraków: Full Text Search with Ferret
16. Index options: term vector
Value Description
:no Don’t store term-vectors
:yes Store term-vectors without storing positions
or offsets.
:with positions Store term-vectors with positions.
:with offsets Store term-vectors with offsets.
:with positions ofssets Store term-vectors with positions and off-
sets.
Ruby Day Kraków: Full Text Search with Ferret
17. Index options: boost
Value Description
Float The boost property is used to set the default
boost for a field. This boost value will used
for all instances of the field in the index un-
less otherwise specified when you create the
field. All values should be positive.
Ruby Day Kraków: Full Text Search with Ferret
18. Search the comments
Searching a model and its related models can be achieved with
virtual attributes.
A getter of all comment messages defined in Post class:
def post_comments
comments.collect{|c| c.message}.join(’ ’)
end
Add like a normal field to ferret’s field list:
acts_as_ferret :fields => [ ’title’, ’body’, ’post_comments’ ]
Ruby Day Kraków: Full Text Search with Ferret
19. Search in multiple models
In case we would like to search for both comments and posts
(multi search) we need to:
create index for both models
for each of them set the store class name flag
After rebuilding indices for Post and Comment we can run a multi
search on both:
Post.multi_search(params[:search],[Comment])
Ruby Day Kraków: Full Text Search with Ferret
20. More like this
We would like a feature of finding the most similar posts to a
chosen one.
That’s pretty simple:
post.more_like_this({:field_names=>[’title’,’body’,’post_comments’],
:min_term_freq => 2, :min_doc_freq => 3})
The options passed here tell the search engine 2 things:
take into consideration only terms that appear more than once
in the source document
take into consideration only terms that appear in minimum 3
documents
Ruby Day Kraków: Full Text Search with Ferret
21. Links
Products:
Swish-e http://swish-e.org/index.html
Lucene http://lucene.apache.org/java/docs/index.html
Nutch http://lucene.apache.org/nutch/
Lucene-WS http://lucene-ws.sourceforge.net/
SOLR http://incubator.apache.org/solr/
Hyper Estraier http://hyperestraier.sourceforge.net/
Ferret http://rubyforge.org/projects/ferret
acts as ferret http://projects.jkraemer.net/acts as ferret/
Reading:
tutorial by Roman Mackovcak: http://blog.zmok.net/articles/2006/10/18/full-
text-search-in-ruby-on-rails-3-ferret
tutorial by Seth Fitzsimmons: http://mojodna.net/searchable/ruby/railsconf.pdf
aaf and Unicode by Albert Ramstedt:
http://albert.delamednoll.se/articles/2005/12/20/the-ferret-plugin-with-simple-
unicode-support
Ruby Day Kraków: Full Text Search with Ferret
22. Thank you!
Good luck using ferret!
Ruby Day Kraków: Full Text Search with Ferret