SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Falcon
Full Text Search Engine
Mar 28, 2015
Adamson University
Master in Information Technology
Advanced Object Oriented Programming
Hideshi Ogoshi
What is Falcon
 Represents its speed and strength
 Light weight full text search engine
 Command line application
 Provides http server mode
 Written in Python programming language
 Only 1 file and 421 lines of code
 Data is stored in SQLite3 database
 https://github.com/hideshi/Falcon
What is full text search engine
 A storage for the text documents
 Extremely faster than SQL query which uses LIKE ‘%%’
partial match expression
 Composed of index manager, index builder and search
function
 Has own data structure called ‘inverted index’
 Each word is splitted into tokens by ‘tokenizer’
What is tokenizer
 Splits words, which are separated by spaces, into
several tokens
 Token is a group of characters
 This is a book -> ‘This’, ‘is’, ‘a’, ‘book’
 It’s useful for many languages which separate words by
spaces like English, French, Tagalog, etc.
 When it comes to applying it to Japanese or Chinese,
etc, it will cause some problems because these
languages don’t use spaces in their sentences.
What is ngram tokenizer
 Kinds of tokenizers which split words or sentences into
several tokens
 Each token has certain number of characters
 Number of characters depends on the type of ngram
tokenizer
 unigram, bigram, trigram, etc.
What is bigram
 How bigram tokenizer split a sentence into tokens
 Each token has two characters
 English
 This is a book -> ‘Th’, ‘hi’, ‘is’, ‘sa’, ‘ab’, ‘bo’, ‘oo’, ‘ok’
 Japanese
 これは本です -> ‘これ’, ‘れは’, ‘は本’, ‘本で’, ‘です’
 Chinese
 这是书 -> ‘这是’, ‘是书’
What is inverted index
 A structure of the data which provides a faster way to
retrieve data
Dictionary Posting List
This 0
is 1 5
a 2 6
book 3
That 4
pen 7
This is a book. That is a pen.
What is inverted index
“government of the people, by the people, for the people,
shall not perish from the earth.”
{“by”, 1, {1: [4]}}, {“earth”, 1, {1: [15]}}, {“for”, 1, {1: [7]}},
{“from”, 1, {1: [13]}}, {“government”, 1, {1: [0]}},
{“not”, 1, {1: [11]}}, {“of”, 1, {1: [1]}},
{“people”, 3, {1: [3, 6, 9]}}, {“perish”, 1, {1: [12]}},
{“shall”, 1, {1: [10]}}, {“the”, 4, {1: [2, 5, 8, 14]}}
Table Definition
INDICES
TOKEN TEXT PRIMARY KEY
POSTING_LIST BLOB
DOCUMENTS
ID INTEGER PRIMARY KEY
TITLE TEXT
CONTENT BLOB
Class Diagram
Performance Tuning
 A token which contains stop words composed of symbols like !”#$%&’()-
=^~¥|@`[{;+:*]},<.>/?_ are ignored by tokenizer to reduce the time for creating
index and searching.
 Document contents are compressed using bzip2 algorithm to reduce the time for
queries. Compression rate is 38.6% at most and average is 79.3%.
 Turn off journal_mode and synchronous so as not to create unnecessary files when
records are inserted. It increases 8% in speed.
 Use bulk insert instead of executing insert statement for each record. It increases
11% in speed.
 Falcon provides in-memory-database mode powered by SQLite3. So while creating
index, Falcon creates new records in its memory so as to reduce the time of I/O
accesses. Then after creating index, in-memory-database will be stored in a file. It
increases 17% in speed.
 Check memory usage constantly for the inverted index objects. When it excesses
the limitation of the usage, data will be stored in the database and deleted from
memory. It increases 380% in speed.
Performance Test
 Wikipedia Japanese / 10265 of articles / 130MB
 MySQL LIKE ‘%%’
 Project started on May 23, 1995
 Number of contributor(s) : 57 including Oracle and Google
 Number of search word(s) : 1, 2, 3
 Execution time (sec) : 2.71, 2.25, 2.02
 Groonga (Full text search engine)
 Project started on Jan 11, 2009
 Number of contributor(s) : 30
 Number of search word(s) : 1, 2, 3
 Execution time (sec) : 0.013, 0.016, 0.059
 Falcon
 Project started on Mar 8, 2015
 Number of contributor(s) : 1
 Number of search word(s) : 1, 2, 3
 Execution time (sec) : 0.137, 0.132, 0.170
Points to be improved
 Pursue scalability and higher performance
 Implement normalizer
 Search result should be sorted by high relativity between
search words and contents
 Develop an application using Falcon
 Highlight
 Snippet
 Keyword suggestion
 Possibility suggestion
 Error correction
 Pagination
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

Full text search
Full text searchFull text search
Full text searchdeleteman
 
JavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and SearchingJavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and SearchingShay Sofer
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)Kira
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with LuceneWO Community
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCampGokulD
 
Url web design
Url web designUrl web design
Url web designCojo34
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Adrien Grand
 

Was ist angesagt? (20)

Lucene
LuceneLucene
Lucene
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Lucene
LuceneLucene
Lucene
 
Full text search
Full text searchFull text search
Full text search
 
JavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and SearchingJavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and Searching
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Sphinx
SphinxSphinx
Sphinx
 
Meher ppt
Meher pptMeher ppt
Meher ppt
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Meher ppt (1)
Meher ppt (1)Meher ppt (1)
Meher ppt (1)
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
Url web design
Url web designUrl web design
Url web design
 
Elasticsearch speed is key
Elasticsearch speed is keyElasticsearch speed is key
Elasticsearch speed is key
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 

Andere mochten auch

Class Diagram V2
Class Diagram V2Class Diagram V2
Class Diagram V2weichen
 
Vertical Image Search Engine
 Vertical Image Search Engine Vertical Image Search Engine
Vertical Image Search Engineshivam_kedia
 
Learning CakePHP from Source Code
Learning CakePHP from Source CodeLearning CakePHP from Source Code
Learning CakePHP from Source CodeHideshi Ogoshi
 
How to create test data
How to create test dataHow to create test data
How to create test dataHideshi Ogoshi
 
MySQL対応全文検索システムMroonga(むるんが)
MySQL対応全文検索システムMroonga(むるんが)MySQL対応全文検索システムMroonga(むるんが)
MySQL対応全文検索システムMroonga(むるんが)Hideshi Ogoshi
 
Functional programming
Functional programmingFunctional programming
Functional programmingHideshi Ogoshi
 
Introduction of Monaca
Introduction of MonacaIntroduction of Monaca
Introduction of MonacaHideshi Ogoshi
 

Andere mochten auch (8)

Class Diagram V2
Class Diagram V2Class Diagram V2
Class Diagram V2
 
Vertical Image Search Engine
 Vertical Image Search Engine Vertical Image Search Engine
Vertical Image Search Engine
 
Learning CakePHP from Source Code
Learning CakePHP from Source CodeLearning CakePHP from Source Code
Learning CakePHP from Source Code
 
How to create test data
How to create test dataHow to create test data
How to create test data
 
MySQL対応全文検索システムMroonga(むるんが)
MySQL対応全文検索システムMroonga(むるんが)MySQL対応全文検索システムMroonga(むるんが)
MySQL対応全文検索システムMroonga(むるんが)
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
Search Engine project ppt
 Search Engine project ppt Search Engine project ppt
Search Engine project ppt
 
Introduction of Monaca
Introduction of MonacaIntroduction of Monaca
Introduction of Monaca
 

Ähnlich wie Falcon Full Text Search Engine

Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1 GokulD
 
ApacheCon NA 2011 report
ApacheCon NA 2011 reportApacheCon NA 2011 report
ApacheCon NA 2011 reportKoji Kawamura
 
Government Polytechnic Arvi-1.pptx
Government Polytechnic Arvi-1.pptxGovernment Polytechnic Arvi-1.pptx
Government Polytechnic Arvi-1.pptxShivamDenge
 
247th ACS Meeting: Experiment Markup Language (ExptML)
247th ACS Meeting: Experiment Markup Language (ExptML)247th ACS Meeting: Experiment Markup Language (ExptML)
247th ACS Meeting: Experiment Markup Language (ExptML)Stuart Chalk
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Vinay Kumar
 
Filebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptxFilebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptxKnoldus Inc.
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and SparkAudible, Inc.
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 
Elastic search apache_solr
Elastic search apache_solrElastic search apache_solr
Elastic search apache_solrmacrochen
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stackVikrant Chauhan
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrievalmghgk
 
2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solrNick Zadrozny
 
Solr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appSolr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appNick Zadrozny
 
Drupal & Summon: Keeping Article Discovery in the Library
Drupal & Summon: Keeping Article Discovery in the LibraryDrupal & Summon: Keeping Article Discovery in the Library
Drupal & Summon: Keeping Article Discovery in the LibraryKen Varnum
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedBeyondTrees
 
Model of semantic textual document clustering
Model of semantic textual document clusteringModel of semantic textual document clustering
Model of semantic textual document clusteringSK Ahammad Fahad
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdfAbanti Aazmin
 

Ähnlich wie Falcon Full Text Search Engine (20)

Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
 
ApacheCon NA 2011 report
ApacheCon NA 2011 reportApacheCon NA 2011 report
ApacheCon NA 2011 report
 
Government Polytechnic Arvi-1.pptx
Government Polytechnic Arvi-1.pptxGovernment Polytechnic Arvi-1.pptx
Government Polytechnic Arvi-1.pptx
 
Shivam PPT.pptx
Shivam PPT.pptxShivam PPT.pptx
Shivam PPT.pptx
 
247th ACS Meeting: Experiment Markup Language (ExptML)
247th ACS Meeting: Experiment Markup Language (ExptML)247th ACS Meeting: Experiment Markup Language (ExptML)
247th ACS Meeting: Experiment Markup Language (ExptML)
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
ElasticSearch Basics
ElasticSearch Basics ElasticSearch Basics
ElasticSearch Basics
 
Filebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptxFilebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptx
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Elastic search apache_solr
Elastic search apache_solrElastic search apache_solr
Elastic search apache_solr
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stack
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
 
2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr
 
Solr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appSolr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your app
 
Drupal & Summon: Keeping Article Discovery in the Library
Drupal & Summon: Keeping Article Discovery in the LibraryDrupal & Summon: Keeping Article Discovery in the Library
Drupal & Summon: Keeping Article Discovery in the Library
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
Model of semantic textual document clustering
Model of semantic textual document clusteringModel of semantic textual document clustering
Model of semantic textual document clustering
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdf
 

Kürzlich hochgeladen

Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 

Kürzlich hochgeladen (20)

Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 

Falcon Full Text Search Engine

  • 1. Falcon Full Text Search Engine Mar 28, 2015 Adamson University Master in Information Technology Advanced Object Oriented Programming Hideshi Ogoshi
  • 2. What is Falcon  Represents its speed and strength  Light weight full text search engine  Command line application  Provides http server mode  Written in Python programming language  Only 1 file and 421 lines of code  Data is stored in SQLite3 database  https://github.com/hideshi/Falcon
  • 3. What is full text search engine  A storage for the text documents  Extremely faster than SQL query which uses LIKE ‘%%’ partial match expression  Composed of index manager, index builder and search function  Has own data structure called ‘inverted index’  Each word is splitted into tokens by ‘tokenizer’
  • 4. What is tokenizer  Splits words, which are separated by spaces, into several tokens  Token is a group of characters  This is a book -> ‘This’, ‘is’, ‘a’, ‘book’  It’s useful for many languages which separate words by spaces like English, French, Tagalog, etc.  When it comes to applying it to Japanese or Chinese, etc, it will cause some problems because these languages don’t use spaces in their sentences.
  • 5. What is ngram tokenizer  Kinds of tokenizers which split words or sentences into several tokens  Each token has certain number of characters  Number of characters depends on the type of ngram tokenizer  unigram, bigram, trigram, etc.
  • 6. What is bigram  How bigram tokenizer split a sentence into tokens  Each token has two characters  English  This is a book -> ‘Th’, ‘hi’, ‘is’, ‘sa’, ‘ab’, ‘bo’, ‘oo’, ‘ok’  Japanese  これは本です -> ‘これ’, ‘れは’, ‘は本’, ‘本で’, ‘です’  Chinese  这是书 -> ‘这是’, ‘是书’
  • 7. What is inverted index  A structure of the data which provides a faster way to retrieve data Dictionary Posting List This 0 is 1 5 a 2 6 book 3 That 4 pen 7 This is a book. That is a pen.
  • 8. What is inverted index “government of the people, by the people, for the people, shall not perish from the earth.” {“by”, 1, {1: [4]}}, {“earth”, 1, {1: [15]}}, {“for”, 1, {1: [7]}}, {“from”, 1, {1: [13]}}, {“government”, 1, {1: [0]}}, {“not”, 1, {1: [11]}}, {“of”, 1, {1: [1]}}, {“people”, 3, {1: [3, 6, 9]}}, {“perish”, 1, {1: [12]}}, {“shall”, 1, {1: [10]}}, {“the”, 4, {1: [2, 5, 8, 14]}}
  • 9. Table Definition INDICES TOKEN TEXT PRIMARY KEY POSTING_LIST BLOB DOCUMENTS ID INTEGER PRIMARY KEY TITLE TEXT CONTENT BLOB
  • 11. Performance Tuning  A token which contains stop words composed of symbols like !”#$%&’()- =^~¥|@`[{;+:*]},<.>/?_ are ignored by tokenizer to reduce the time for creating index and searching.  Document contents are compressed using bzip2 algorithm to reduce the time for queries. Compression rate is 38.6% at most and average is 79.3%.  Turn off journal_mode and synchronous so as not to create unnecessary files when records are inserted. It increases 8% in speed.  Use bulk insert instead of executing insert statement for each record. It increases 11% in speed.  Falcon provides in-memory-database mode powered by SQLite3. So while creating index, Falcon creates new records in its memory so as to reduce the time of I/O accesses. Then after creating index, in-memory-database will be stored in a file. It increases 17% in speed.  Check memory usage constantly for the inverted index objects. When it excesses the limitation of the usage, data will be stored in the database and deleted from memory. It increases 380% in speed.
  • 12. Performance Test  Wikipedia Japanese / 10265 of articles / 130MB  MySQL LIKE ‘%%’  Project started on May 23, 1995  Number of contributor(s) : 57 including Oracle and Google  Number of search word(s) : 1, 2, 3  Execution time (sec) : 2.71, 2.25, 2.02  Groonga (Full text search engine)  Project started on Jan 11, 2009  Number of contributor(s) : 30  Number of search word(s) : 1, 2, 3  Execution time (sec) : 0.013, 0.016, 0.059  Falcon  Project started on Mar 8, 2015  Number of contributor(s) : 1  Number of search word(s) : 1, 2, 3  Execution time (sec) : 0.137, 0.132, 0.170
  • 13. Points to be improved  Pursue scalability and higher performance  Implement normalizer  Search result should be sorted by high relativity between search words and contents  Develop an application using Falcon  Highlight  Snippet  Keyword suggestion  Possibility suggestion  Error correction  Pagination