This is a presentation for my class in graduate school. I'm going to introduce a command line based full text search engine written in Python by scratch.
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
Falcon Full Text Search Engine
1. Falcon
Full Text Search Engine
Mar 28, 2015
Adamson University
Master in Information Technology
Advanced Object Oriented Programming
Hideshi Ogoshi
2. What is Falcon
Represents its speed and strength
Light weight full text search engine
Command line application
Provides http server mode
Written in Python programming language
Only 1 file and 421 lines of code
Data is stored in SQLite3 database
https://github.com/hideshi/Falcon
3. What is full text search engine
A storage for the text documents
Extremely faster than SQL query which uses LIKE ‘%%’
partial match expression
Composed of index manager, index builder and search
function
Has own data structure called ‘inverted index’
Each word is splitted into tokens by ‘tokenizer’
4. What is tokenizer
Splits words, which are separated by spaces, into
several tokens
Token is a group of characters
This is a book -> ‘This’, ‘is’, ‘a’, ‘book’
It’s useful for many languages which separate words by
spaces like English, French, Tagalog, etc.
When it comes to applying it to Japanese or Chinese,
etc, it will cause some problems because these
languages don’t use spaces in their sentences.
5. What is ngram tokenizer
Kinds of tokenizers which split words or sentences into
several tokens
Each token has certain number of characters
Number of characters depends on the type of ngram
tokenizer
unigram, bigram, trigram, etc.
6. What is bigram
How bigram tokenizer split a sentence into tokens
Each token has two characters
English
This is a book -> ‘Th’, ‘hi’, ‘is’, ‘sa’, ‘ab’, ‘bo’, ‘oo’, ‘ok’
Japanese
これは本です -> ‘これ’, ‘れは’, ‘は本’, ‘本で’, ‘です’
Chinese
这是书 -> ‘这是’, ‘是书’
7. What is inverted index
A structure of the data which provides a faster way to
retrieve data
Dictionary Posting List
This 0
is 1 5
a 2 6
book 3
That 4
pen 7
This is a book. That is a pen.
8. What is inverted index
“government of the people, by the people, for the people,
shall not perish from the earth.”
{“by”, 1, {1: [4]}}, {“earth”, 1, {1: [15]}}, {“for”, 1, {1: [7]}},
{“from”, 1, {1: [13]}}, {“government”, 1, {1: [0]}},
{“not”, 1, {1: [11]}}, {“of”, 1, {1: [1]}},
{“people”, 3, {1: [3, 6, 9]}}, {“perish”, 1, {1: [12]}},
{“shall”, 1, {1: [10]}}, {“the”, 4, {1: [2, 5, 8, 14]}}
11. Performance Tuning
A token which contains stop words composed of symbols like !”#$%&’()-
=^~¥|@`[{;+:*]},<.>/?_ are ignored by tokenizer to reduce the time for creating
index and searching.
Document contents are compressed using bzip2 algorithm to reduce the time for
queries. Compression rate is 38.6% at most and average is 79.3%.
Turn off journal_mode and synchronous so as not to create unnecessary files when
records are inserted. It increases 8% in speed.
Use bulk insert instead of executing insert statement for each record. It increases
11% in speed.
Falcon provides in-memory-database mode powered by SQLite3. So while creating
index, Falcon creates new records in its memory so as to reduce the time of I/O
accesses. Then after creating index, in-memory-database will be stored in a file. It
increases 17% in speed.
Check memory usage constantly for the inverted index objects. When it excesses
the limitation of the usage, data will be stored in the database and deleted from
memory. It increases 380% in speed.
12. Performance Test
Wikipedia Japanese / 10265 of articles / 130MB
MySQL LIKE ‘%%’
Project started on May 23, 1995
Number of contributor(s) : 57 including Oracle and Google
Number of search word(s) : 1, 2, 3
Execution time (sec) : 2.71, 2.25, 2.02
Groonga (Full text search engine)
Project started on Jan 11, 2009
Number of contributor(s) : 30
Number of search word(s) : 1, 2, 3
Execution time (sec) : 0.013, 0.016, 0.059
Falcon
Project started on Mar 8, 2015
Number of contributor(s) : 1
Number of search word(s) : 1, 2, 3
Execution time (sec) : 0.137, 0.132, 0.170
13. Points to be improved
Pursue scalability and higher performance
Implement normalizer
Search result should be sorted by high relativity between
search words and contents
Develop an application using Falcon
Highlight
Snippet
Keyword suggestion
Possibility suggestion
Error correction
Pagination