SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Falcon
Full Text Search Engine
Mar 28, 2015
Adamson University
Master in Information Technology
Advanced Object Oriented Programming
Hideshi Ogoshi
What is Falcon
 Represents its speed and strength
 Light weight full text search engine
 Command line application
 Provides http server mode
 Written in Python programming language
 Only 1 file and 421 lines of code
 Data is stored in SQLite3 database
 https://github.com/hideshi/Falcon
What is full text search engine
 A storage for the text documents
 Extremely faster than SQL query which uses LIKE ‘%%’
partial match expression
 Composed of index manager, index builder and search
function
 Has own data structure called ‘inverted index’
 Each word is splitted into tokens by ‘tokenizer’
What is tokenizer
 Splits words, which are separated by spaces, into
several tokens
 Token is a group of characters
 This is a book -> ‘This’, ‘is’, ‘a’, ‘book’
 It’s useful for many languages which separate words by
spaces like English, French, Tagalog, etc.
 When it comes to applying it to Japanese or Chinese,
etc, it will cause some problems because these
languages don’t use spaces in their sentences.
What is ngram tokenizer
 Kinds of tokenizers which split words or sentences into
several tokens
 Each token has certain number of characters
 Number of characters depends on the type of ngram
tokenizer
 unigram, bigram, trigram, etc.
What is bigram
 How bigram tokenizer split a sentence into tokens
 Each token has two characters
 English
 This is a book -> ‘Th’, ‘hi’, ‘is’, ‘sa’, ‘ab’, ‘bo’, ‘oo’, ‘ok’
 Japanese
 これは本です -> ‘これ’, ‘れは’, ‘は本’, ‘本で’, ‘です’
 Chinese
 这是书 -> ‘这是’, ‘是书’
What is inverted index
 A structure of the data which provides a faster way to
retrieve data
Dictionary Posting List
This 0
is 1 5
a 2 6
book 3
That 4
pen 7
This is a book. That is a pen.
What is inverted index
“government of the people, by the people, for the people,
shall not perish from the earth.”
{“by”, 1, {1: [4]}}, {“earth”, 1, {1: [15]}}, {“for”, 1, {1: [7]}},
{“from”, 1, {1: [13]}}, {“government”, 1, {1: [0]}},
{“not”, 1, {1: [11]}}, {“of”, 1, {1: [1]}},
{“people”, 3, {1: [3, 6, 9]}}, {“perish”, 1, {1: [12]}},
{“shall”, 1, {1: [10]}}, {“the”, 4, {1: [2, 5, 8, 14]}}
Table Definition
INDICES
TOKEN TEXT PRIMARY KEY
POSTING_LIST BLOB
DOCUMENTS
ID INTEGER PRIMARY KEY
TITLE TEXT
CONTENT BLOB
Class Diagram
Performance Tuning
 A token which contains stop words composed of symbols like !”#$%&’()-
=^~¥|@`[{;+:*]},<.>/?_ are ignored by tokenizer to reduce the time for creating
index and searching.
 Document contents are compressed using bzip2 algorithm to reduce the time for
queries. Compression rate is 38.6% at most and average is 79.3%.
 Turn off journal_mode and synchronous so as not to create unnecessary files when
records are inserted. It increases 8% in speed.
 Use bulk insert instead of executing insert statement for each record. It increases
11% in speed.
 Falcon provides in-memory-database mode powered by SQLite3. So while creating
index, Falcon creates new records in its memory so as to reduce the time of I/O
accesses. Then after creating index, in-memory-database will be stored in a file. It
increases 17% in speed.
 Check memory usage constantly for the inverted index objects. When it excesses
the limitation of the usage, data will be stored in the database and deleted from
memory. It increases 380% in speed.
Performance Test
 Wikipedia Japanese / 10265 of articles / 130MB
 MySQL LIKE ‘%%’
 Project started on May 23, 1995
 Number of contributor(s) : 57 including Oracle and Google
 Number of search word(s) : 1, 2, 3
 Execution time (sec) : 2.71, 2.25, 2.02
 Groonga (Full text search engine)
 Project started on Jan 11, 2009
 Number of contributor(s) : 30
 Number of search word(s) : 1, 2, 3
 Execution time (sec) : 0.013, 0.016, 0.059
 Falcon
 Project started on Mar 8, 2015
 Number of contributor(s) : 1
 Number of search word(s) : 1, 2, 3
 Execution time (sec) : 0.137, 0.132, 0.170
Points to be improved
 Pursue scalability and higher performance
 Implement normalizer
 Search result should be sorted by high relativity between
search words and contents
 Develop an application using Falcon
 Highlight
 Snippet
 Keyword suggestion
 Possibility suggestion
 Error correction
 Pagination
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

Full text search
Full text searchFull text search
Full text searchdeleteman
 
JavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and SearchingJavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and SearchingShay Sofer
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)Kira
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with LuceneWO Community
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCampGokulD
 
Url web design
Url web designUrl web design
Url web designCojo34
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Adrien Grand
 

Was ist angesagt? (20)

Lucene
LuceneLucene
Lucene
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Lucene
LuceneLucene
Lucene
 
Full text search
Full text searchFull text search
Full text search
 
JavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and SearchingJavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and Searching
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Sphinx
SphinxSphinx
Sphinx
 
Meher ppt
Meher pptMeher ppt
Meher ppt
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Meher ppt (1)
Meher ppt (1)Meher ppt (1)
Meher ppt (1)
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
Url web design
Url web designUrl web design
Url web design
 
Elasticsearch speed is key
Elasticsearch speed is keyElasticsearch speed is key
Elasticsearch speed is key
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 

Andere mochten auch

Class Diagram V2
Class Diagram V2Class Diagram V2
Class Diagram V2weichen
 
Vertical Image Search Engine
 Vertical Image Search Engine Vertical Image Search Engine
Vertical Image Search Engineshivam_kedia
 
Learning CakePHP from Source Code
Learning CakePHP from Source CodeLearning CakePHP from Source Code
Learning CakePHP from Source CodeHideshi Ogoshi
 
How to create test data
How to create test dataHow to create test data
How to create test dataHideshi Ogoshi
 
MySQL対応全文検索システムMroonga(むるんが)
MySQL対応全文検索システムMroonga(むるんが)MySQL対応全文検索システムMroonga(むるんが)
MySQL対応全文検索システムMroonga(むるんが)Hideshi Ogoshi
 
Functional programming
Functional programmingFunctional programming
Functional programmingHideshi Ogoshi
 
Introduction of Monaca
Introduction of MonacaIntroduction of Monaca
Introduction of MonacaHideshi Ogoshi
 

Andere mochten auch (8)

Class Diagram V2
Class Diagram V2Class Diagram V2
Class Diagram V2
 
Vertical Image Search Engine
 Vertical Image Search Engine Vertical Image Search Engine
Vertical Image Search Engine
 
Learning CakePHP from Source Code
Learning CakePHP from Source CodeLearning CakePHP from Source Code
Learning CakePHP from Source Code
 
How to create test data
How to create test dataHow to create test data
How to create test data
 
MySQL対応全文検索システムMroonga(むるんが)
MySQL対応全文検索システムMroonga(むるんが)MySQL対応全文検索システムMroonga(むるんが)
MySQL対応全文検索システムMroonga(むるんが)
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
Search Engine project ppt
 Search Engine project ppt Search Engine project ppt
Search Engine project ppt
 
Introduction of Monaca
Introduction of MonacaIntroduction of Monaca
Introduction of Monaca
 

Ähnlich wie Falcon Full Text Search Engine

Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1 GokulD
 
ApacheCon NA 2011 report
ApacheCon NA 2011 reportApacheCon NA 2011 report
ApacheCon NA 2011 reportKoji Kawamura
 
Government Polytechnic Arvi-1.pptx
Government Polytechnic Arvi-1.pptxGovernment Polytechnic Arvi-1.pptx
Government Polytechnic Arvi-1.pptxShivamDenge
 
247th ACS Meeting: Experiment Markup Language (ExptML)
247th ACS Meeting: Experiment Markup Language (ExptML)247th ACS Meeting: Experiment Markup Language (ExptML)
247th ACS Meeting: Experiment Markup Language (ExptML)Stuart Chalk
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Vinay Kumar
 
Filebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptxFilebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptxKnoldus Inc.
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and SparkAudible, Inc.
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 
Elastic search apache_solr
Elastic search apache_solrElastic search apache_solr
Elastic search apache_solrmacrochen
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stackVikrant Chauhan
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrievalmghgk
 
2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solrNick Zadrozny
 
Solr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appSolr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appNick Zadrozny
 
Drupal & Summon: Keeping Article Discovery in the Library
Drupal & Summon: Keeping Article Discovery in the LibraryDrupal & Summon: Keeping Article Discovery in the Library
Drupal & Summon: Keeping Article Discovery in the LibraryKen Varnum
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedBeyondTrees
 
Model of semantic textual document clustering
Model of semantic textual document clusteringModel of semantic textual document clustering
Model of semantic textual document clusteringSK Ahammad Fahad
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdfAbanti Aazmin
 

Ähnlich wie Falcon Full Text Search Engine (20)

Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
 
ApacheCon NA 2011 report
ApacheCon NA 2011 reportApacheCon NA 2011 report
ApacheCon NA 2011 report
 
Government Polytechnic Arvi-1.pptx
Government Polytechnic Arvi-1.pptxGovernment Polytechnic Arvi-1.pptx
Government Polytechnic Arvi-1.pptx
 
Shivam PPT.pptx
Shivam PPT.pptxShivam PPT.pptx
Shivam PPT.pptx
 
247th ACS Meeting: Experiment Markup Language (ExptML)
247th ACS Meeting: Experiment Markup Language (ExptML)247th ACS Meeting: Experiment Markup Language (ExptML)
247th ACS Meeting: Experiment Markup Language (ExptML)
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
ElasticSearch Basics
ElasticSearch Basics ElasticSearch Basics
ElasticSearch Basics
 
Filebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptxFilebeat Elastic Search Presentation.pptx
Filebeat Elastic Search Presentation.pptx
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Elastic search apache_solr
Elastic search apache_solrElastic search apache_solr
Elastic search apache_solr
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stack
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
 
2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr
 
Solr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appSolr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your app
 
Drupal & Summon: Keeping Article Discovery in the Library
Drupal & Summon: Keeping Article Discovery in the LibraryDrupal & Summon: Keeping Article Discovery in the Library
Drupal & Summon: Keeping Article Discovery in the Library
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
Model of semantic textual document clustering
Model of semantic textual document clusteringModel of semantic textual document clustering
Model of semantic textual document clustering
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdf
 

Kürzlich hochgeladen

OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Anthony Dahanne
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 

Kürzlich hochgeladen (20)

OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 

Falcon Full Text Search Engine

  • 1. Falcon Full Text Search Engine Mar 28, 2015 Adamson University Master in Information Technology Advanced Object Oriented Programming Hideshi Ogoshi
  • 2. What is Falcon  Represents its speed and strength  Light weight full text search engine  Command line application  Provides http server mode  Written in Python programming language  Only 1 file and 421 lines of code  Data is stored in SQLite3 database  https://github.com/hideshi/Falcon
  • 3. What is full text search engine  A storage for the text documents  Extremely faster than SQL query which uses LIKE ‘%%’ partial match expression  Composed of index manager, index builder and search function  Has own data structure called ‘inverted index’  Each word is splitted into tokens by ‘tokenizer’
  • 4. What is tokenizer  Splits words, which are separated by spaces, into several tokens  Token is a group of characters  This is a book -> ‘This’, ‘is’, ‘a’, ‘book’  It’s useful for many languages which separate words by spaces like English, French, Tagalog, etc.  When it comes to applying it to Japanese or Chinese, etc, it will cause some problems because these languages don’t use spaces in their sentences.
  • 5. What is ngram tokenizer  Kinds of tokenizers which split words or sentences into several tokens  Each token has certain number of characters  Number of characters depends on the type of ngram tokenizer  unigram, bigram, trigram, etc.
  • 6. What is bigram  How bigram tokenizer split a sentence into tokens  Each token has two characters  English  This is a book -> ‘Th’, ‘hi’, ‘is’, ‘sa’, ‘ab’, ‘bo’, ‘oo’, ‘ok’  Japanese  これは本です -> ‘これ’, ‘れは’, ‘は本’, ‘本で’, ‘です’  Chinese  这是书 -> ‘这是’, ‘是书’
  • 7. What is inverted index  A structure of the data which provides a faster way to retrieve data Dictionary Posting List This 0 is 1 5 a 2 6 book 3 That 4 pen 7 This is a book. That is a pen.
  • 8. What is inverted index “government of the people, by the people, for the people, shall not perish from the earth.” {“by”, 1, {1: [4]}}, {“earth”, 1, {1: [15]}}, {“for”, 1, {1: [7]}}, {“from”, 1, {1: [13]}}, {“government”, 1, {1: [0]}}, {“not”, 1, {1: [11]}}, {“of”, 1, {1: [1]}}, {“people”, 3, {1: [3, 6, 9]}}, {“perish”, 1, {1: [12]}}, {“shall”, 1, {1: [10]}}, {“the”, 4, {1: [2, 5, 8, 14]}}
  • 9. Table Definition INDICES TOKEN TEXT PRIMARY KEY POSTING_LIST BLOB DOCUMENTS ID INTEGER PRIMARY KEY TITLE TEXT CONTENT BLOB
  • 11. Performance Tuning  A token which contains stop words composed of symbols like !”#$%&’()- =^~¥|@`[{;+:*]},<.>/?_ are ignored by tokenizer to reduce the time for creating index and searching.  Document contents are compressed using bzip2 algorithm to reduce the time for queries. Compression rate is 38.6% at most and average is 79.3%.  Turn off journal_mode and synchronous so as not to create unnecessary files when records are inserted. It increases 8% in speed.  Use bulk insert instead of executing insert statement for each record. It increases 11% in speed.  Falcon provides in-memory-database mode powered by SQLite3. So while creating index, Falcon creates new records in its memory so as to reduce the time of I/O accesses. Then after creating index, in-memory-database will be stored in a file. It increases 17% in speed.  Check memory usage constantly for the inverted index objects. When it excesses the limitation of the usage, data will be stored in the database and deleted from memory. It increases 380% in speed.
  • 12. Performance Test  Wikipedia Japanese / 10265 of articles / 130MB  MySQL LIKE ‘%%’  Project started on May 23, 1995  Number of contributor(s) : 57 including Oracle and Google  Number of search word(s) : 1, 2, 3  Execution time (sec) : 2.71, 2.25, 2.02  Groonga (Full text search engine)  Project started on Jan 11, 2009  Number of contributor(s) : 30  Number of search word(s) : 1, 2, 3  Execution time (sec) : 0.013, 0.016, 0.059  Falcon  Project started on Mar 8, 2015  Number of contributor(s) : 1  Number of search word(s) : 1, 2, 3  Execution time (sec) : 0.137, 0.132, 0.170
  • 13. Points to be improved  Pursue scalability and higher performance  Implement normalizer  Search result should be sorted by high relativity between search words and contents  Develop an application using Falcon  Highlight  Snippet  Keyword suggestion  Possibility suggestion  Error correction  Pagination