SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
PostgreSQL FTS
PGBR 2013
by Emanuel Calvo
Saturday, August 17, 13
Palomino - Service Offerings
• Monthly Support:
o Being renamed to Palomino DBA as a service.
o Eliminating 10 hour monthly clients.
o Discounts are based on spend per month (0-80, 81-160, 161+
o We will be penalizing excessive paging financially.
o Quarterly onsite day from Palomino executive, DBA and PM for clients
using 80 hours or more per month.
o Clients using 80-160 hours get 2 New Relic licenses. 160 hours plus
get 4.
• Adding annual support contracts:
o Consultation as needed.
o Emergency pages allowed.
o Small bucket of DBA hours (8, 16 or 24)
For more information, please go to: Spreadsheet
“Advanced Technology Partner” for Amazon Web Services
Saturday, August 17, 13
About me:
• Operational DBA at PalominoDB.
• MySQL, Maria and PostgreSQL databases (and others)
• Community member
• Check out my LinkedIn Profile at: http://es.linkedin.com/in/ecbcbcb/
Saturday, August 17, 13
Credits
• Thanks to:
o Andrew Atanasoff
o Vlad Fedorkov
o All the PalominoDB people that help out !
Saturday, August 17, 13
Agenda
• What we are looking for?
• Concepts
• Native Postgres Support
o http://www.postgresql.org/docs/9.2/static/textsearch.html
• External solutions
o Sphinx
 http://sphinxsearch.com/
o Solr
 http://lucene.apache.org/solr/
Saturday, August 17, 13
What are we looking for?
• Finish the talk knowing what FTS is for.
• Have an idea which are our tools.
• Which of those tools are the best fit.
Saturday, August 17, 13
Goals of FTS
• Add complex searches using synonyms, specific operators or spellings.
o Improving performance sacrificing accuracy.
• Reduce IO and CPU utilization.
o Text consumes a lot of IO for read and CPU for operations.
• FTS can be handled:
o Externally
 using tools like Sphinx or Solr
 Ideal for massive text search, simple queries
o Internally
 native FTS support.
 Ideal for complex queries combining with other business rules
• Order words by relevance
• Language sensitive
• Faster than regular expressions or LIKE operands
Saturday, August 17, 13
The future of native FTS
• https://wiki.postgresql.org/wiki/PGCon2013_Unconference_Future_of_Full-
Text_Search
• http://wiki.postgresql.org/images/2/25/Full-
text_search_in_PostgreSQL_in_milliseconds-extended-version.pdf
• 150 Kb patch for 9.3
• GIN/GIST interface improved
Saturday, August 17, 13
WHY FTS??
You may be asking this now.
Saturday, August 17, 13
... and the answer is
this reaction when
I see a “LIKE ‘%pattern%’”
Saturday, August 17, 13
Concepts
• Parsers
o 23 token types (url, email, file, etc)
• Token
• Stop word
• Lexeme
o array of lexemes + position + weight = tsvector
• Dictionaries
o Simple Dictionary
 The simple dictionary template operates by converting the input
token to lower case and checking it against a file of stop words.
o Synonym Dictionary
o Thesaurus Dictionary
o Ispell Dictionary
o Snowball Dictionary
Saturday, August 17, 13
Limitations
• The length of each lexeme must be less than 2K bytes
• The length of a tsvector (lexemes + positions) must be less than 1
megabyte
• The number of lexemes must be less than 264
• Position values in tsvector must be greater than 0 and no more than
16,383
• No more than 256 positions per lexeme
• The number of nodes (lexemes + operators) in a tsquery must be less than
32,768
• Those limits are hard to be reached!
For comparison, the PostgreSQL 8.1 documentation contained 10,441 unique words, a total of
335,420 words, and the most frequent word “postgresql” was mentioned 6,127 times in 655
documents.
Another example — the PostgreSQL mailing list archives contained 910,989 unique words with
57,491,343 lexemes in 461,020 messages.
Saturday, August 17, 13
Elements
dF[+] [PATTERN] list text search configurations
dFd[+] [PATTERN] list text search dictionaries
dFp[+] [PATTERN] list text search parsers
dFt[+] [PATTERN] list text search templates
full_text_search=# dFd+ *
List of text search dictionaries
Schema | Name | Template | Init options | Description
------------+-----------------+---------------------+---------------------------------------------------+-----------------------------------------------------------
pg_catalog | danish_stem | pg_catalog.snowball | language = 'danish', stopwords = 'danish' | snowball stemmer for danish language
pg_catalog | dutch_stem | pg_catalog.snowball | language = 'dutch', stopwords = 'dutch' | snowball stemmer for dutch language
pg_catalog | english_stem | pg_catalog.snowball | language = 'english', stopwords = 'english' | snowball stemmer for english language
pg_catalog | finnish_stem | pg_catalog.snowball | language = 'finnish', stopwords = 'finnish' | snowball stemmer for finnish language
pg_catalog | french_stem | pg_catalog.snowball | language = 'french', stopwords = 'french' | snowball stemmer for french language
pg_catalog | german_stem | pg_catalog.snowball | language = 'german', stopwords = 'german' | snowball stemmer for german
language
pg_catalog | hungarian_stem | pg_catalog.snowball | language = 'hungarian', stopwords = 'hungarian' | snowball stemmer for hungarian language
pg_catalog | italian_stem | pg_catalog.snowball | language = 'italian', stopwords = 'italian' | snowball stemmer for italian language
pg_catalog | norwegian_stem | pg_catalog.snowball | language = 'norwegian', stopwords = 'norwegian' | snowball stemmer for norwegian language
pg_catalog | portuguese_stem | pg_catalog.snowball | language = 'portuguese', stopwords = 'portuguese' | snowball stemmer for portuguese language
pg_catalog | romanian_stem | pg_catalog.snowball | language = 'romanian' | snowball stemmer for romanian language
pg_catalog | russian_stem | pg_catalog.snowball | language = 'russian', stopwords = 'russian' | snowball stemmer for russian language
pg_catalog | simple | pg_catalog.simple | | simple dictionary: just lower case and check for
stopword
pg_catalog | spanish_stem | pg_catalog.snowball | language = 'spanish', stopwords = 'spanish' | snowball stemmer for spanish
language
pg_catalog | swedish_stem | pg_catalog.snowball | language = 'swedish', stopwords = 'swedish' | snowball stemmer for swedish
language
pg_catalog | turkish_stem | pg_catalog.snowball | language = 'turkish', stopwords = 'turkish' | snowball stemmer for turkish language
(16 rows)
Saturday, August 17, 13
Elements
postgres=# dF
List of text search configurations
Schema | Name | Description
------------+------------+---------------------------------------
pg_catalog | danish | configuration for danish language
pg_catalog | dutch | configuration for dutch language
pg_catalog | english | configuration for english language
pg_catalog | finnish | configuration for finnish language
pg_catalog | french | configuration for french language
pg_catalog | german | configuration for german language
pg_catalog | hungarian | configuration for hungarian language
pg_catalog | italian | configuration for italian language
pg_catalog | norwegian | configuration for norwegian language
pg_catalog | portuguese | configuration for portuguese language
pg_catalog | romanian | configuration for romanian language
pg_catalog | russian | configuration for russian language
pg_catalog | simple | simple configuration
pg_catalog | spanish | configuration for spanish language
pg_catalog | swedish | configuration for swedish language
pg_catalog | turkish | configuration for turkish language
(16 rows)
Saturday, August 17, 13
Elements
List of data types
Schema | Name | Description
------------+-----------+---------------------------------------------------------
pg_catalog | gtsvector | GiST index internal text representation for text search
pg_catalog | tsquery | query representation for text search
pg_catalog | tsvector | text representation for text search
(3 rows)
Some operators:
• @@ (tsvector against tsquery)
• || concatenate tsvectors (it reorganises lexemes and ranking)
Saturday, August 17, 13
Small Example
full_text_search=# create table basic_example (i serial PRIMARY KEY, whole text, fulled tsvector, dictionary
regconfig);
postgres=# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON basic_example FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(fulled,
"pg_catalog.english", whole);
CREATE TRIGGER
postgres=# insert into basic_example(whole,dictionary) values ('This is an example','english'::regconfig);
INSERT 0 1
full_text_search=# create index on basic_example(to_tsvector(dictionary,whole));
CREATE INDEX
full_text_search=# create index on basic_example using GIST(to_tsvector(dictionary,whole));
CREATE INDEX
postgres=# select * from basic_example;
i | whole | fulled | dictionary
---+--------------------+------------+------------
5 | This is an example | 'exampl':4 | english
(1 row)
Saturday, August 17, 13
Pre processing
• Documents into tokens
 Find and clean
• Tokens into lexemes
o Token normalised to a language or dictionary
o Eliminate stop words ( high frequently words)
• Storing
o Array of lexemes (tsvector)
 the position of the word respect the presence of stop words, although
they are not stored
 Stores positional information for proximity info
Saturday, August 17, 13
Highlighting
• ts_headline
o it doesn't use tsvector and needs to use the entire document, so could be
expensive.
• Only for certain type of queries or titles
postgres=# SELECT ts_headline('english','Just a simple example of a highlighted query and
similarity.',to_tsquery('query & similarity'),'StartSel = <, StopSel = >');
ts_headline
------------------------------------------------------------------
Just a simple example of a highlighted <query> and <similarity>.
(1 row)
Default:
StartSel=<b>, StopSel=</b>,
MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE,
MaxFragments=0, FragmentDelimiter=" ... "
Saturday, August 17, 13
Ranking
• Weights: (A B C D)
• Ranking functions:
o ts_rank
o ts_rank_cd
• Ranking is expensive cause re process and check each tsvector.
SELECT to_tsquery(’english’, ’Fat | Rats:AB’);
to_tsquery
------------------
’fat’ | ’rat’:AB
Also, * can be attached to a lexeme to specify prefix matching:
SELECT to_tsquery(’supern:*A & star:A*B’);
to_tsquery
--------------------------
’supern’:*A & ’star’:*AB
Saturday, August 17, 13
Maniputaling tsvectors and tsquery
• Manipulating tsvectors
o setweight(vector tsvector, weight "char") returns tsvector
o lenght (tsvector) : number of lexemes
o strip (tsvector): returns tsvector without additional position as weight or
position
• Manipulating Queries
• If you need a dynamic input for a query, parse it with numnode(tsquery), it will
avoid unnecessary searches if contains a lot of stop words
o numnode(plainto_tsquery(’a the is’))
o clean the queries using querytree also, is useful
Saturday, August 17, 13
Example
postgres=# select * from ts_debug('english','The doctor saids I''m sick.');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+--------+----------------+--------------+----------
asciiword | Word, all ASCII | The | {english_stem} | english_stem | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | doctor | {english_stem} | english_stem | {doctor}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | saids | {english_stem} | english_stem | {said}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | I | {english_stem} | english_stem | {}
blank | Space symbols | ' | {} | |
asciiword | Word, all ASCII | m | {english_stem} | english_stem | {m}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | sick | {english_stem} | english_stem | {sick}
blank | Space symbols | . | {} | |
(12 rows)
postgres=# select numnode(plainto_tsquery('The doctor saids I''m sick.')), plainto_tsquery('The doctor saids I''m sick.'),
to_tsvector('english','The doctor saids I''m sick.'), ts_lexize('english_stem','The doctor saids I''m sick.');
numnode | plainto_tsquery | to_tsvector | ts_lexize
---------+----------------------------------+------------------------------------+--------------------------------
7 | 'doctor' & 'said' & 'm' & 'sick' | 'doctor':2 'm':5 'said':3 'sick':6 | {"the doctor saids i'm sick."}
(1 row)
Saturday, August 17, 13
Maniputaling tsquery
postgres=# SELECT querytree(to_tsquery('!defined'));
querytree
-----------
T
(1 row)
postgres=# SELECT querytree(to_tsquery('cat & food | (dog & run & food)'));
querytree
-----------------------------------------
'cat' & 'food' | 'dog' & 'run' & 'food'
(1 row)
postgres=# SELECT querytree(to_tsquery('the '));
NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
querytree
-----------
(1 row)
Saturday, August 17, 13
Automating updates on tsvector
• Postgresql provide standard functions for this:
o tsvector_update_trigger(tsvector_column_name, config_name,
text_column_name [, ... ])
o tsvector_update_trigger_column(tsvector_column_name,
config_column_name, text_column_name [, ...
CREATE TABLE messages (
title text,
body text,
tsv tsvector
);
CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON messages FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv, ’pg_catalog.english’, title, body);
Saturday, August 17, 13
Automating updates on tsvector (2)
If you want to keep a custom weight:
CREATE FUNCTION messages_trigger() RETURNS trigger AS $$
begin
new.tsv :=
setweight(to_tsvector(’pg_catalog.english’, coalesce(new.title,”)), ’A’) ||
setweight(to_tsvector(’pg_catalog.english’, coalesce(new.body,”)), ’D’);
return new;
end
$$ LANGUAGE plpgsql;
CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON messages FOR EACH ROW EXECUTE PROCEDURE messages_trigger();
Saturday, August 17, 13
Tips and considerations
• Store the text externally, index on the database
o requires superuser
• Store the whole document on the database, index on Sphinx/Solr
• Don't index everything
o Solr /Sphinx are not databases, just index only what you want to search.
Smaller indexes are faster and easy to maintain.
• ts_stats
o can help you out to check your FTS configuration
• You can parse URLS, mails and whatever using ts_debug function for nun
intensive operations
Saturday, August 17, 13
Tips and considerations
• You can index by language
CREATE INDEX pgweb_idx_en ON pgweb USING gin(to_tsvector(’english’, body)) WHERE config_language
= 'english';
CREATE INDEX pgweb_idx_fr ON pgweb USING gin(to_tsvector(’french’, body)) WHERE config_language =
'french';
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(config_language, body));
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(’english’, title || ’ ’ || body));
• Table partition using language is also a good practice
Saturday, August 17, 13
Features on 9.2
• Move tsvector most-common-element statistics to new pg_stats columns
(Alexander Korotkov)
• Consult most_common_elems and most_common_elem_freqs for the data
formerly available in most_common_vals and most_common_freqs for a tsvector
column.
most_common_elems | {exampl}
most_common_elem_freqs | {1,1,1}
Saturday, August 17, 13
Links
• http://www.postgresql.org/docs/9.2/static/textsearch.htm
• http://www.postgresql.org/docs/9.2/static/textsearch-migration.html >
migration from version pre-8.3
Saturday, August 17, 13
Sphinx
Saturday, August 17, 13
Sphinx
• Standalone daemon written on C++
• Highly scalable
o Known installation consists 50+ Boxes, 20+ Billions of documents
• Extended search for text and non-full-text data
o Optimized for faceted search
o Snippets generation based on language settings
• Very fast
o Keeps attributes in memory
 See Percona benchmarks for details
• Receiving data from PostgreSQL
o Dedicated PostgreSQL datasource type.
http://sphinxsearch.com
Saturday, August 17, 13
Key features- Sphinx
• Scalability & failover
• Extended FT language
• Faceted search support
• GEO-search support
• Integration and pluggable architecture
• Dedicated PostgreSQL source, UDF support
• Morphology & stemming
• Both batch & real-time indexing is available
• Parallel snippets generation
Saturday, August 17, 13
Sphinx - Basic 1 host architecture
Postgres
Application
sphinx daemon/API
listen = 9312
listen = 9306:mysql41
Indexes
Query
Result
Additional Data
Indexing ASYNC
Saturday, August 17, 13
What's new on Sphinx
• 1. added AOT (new morphology library, lemmatizer) support
o Russian only for now; English coming soon; small 10-20% indexing
impact; it's all about search quality (much much better "stemming")
• 2. added JSON support
o limited support (limited subset of JSON) for now; JSON sits in a
column; you're able to do thing like WHERE jsoncol.key=123 or
ORDER BY or GROUP BY
• 3. added subselect syntax that reorders result sets, SELECT * FROM
(SELECT ... ORDER BY cond1 LIMIT X) ORDER BY cond2 LIMIT Y
• 4. added bigram indexing, and quicker phrase searching with bigrams
(bigram_index, bigram_freq_words directives)
o improves the worst cases for social mining
• 5. added HA support, ha_strategy, agent_mirror directives
• 6. added a few new geofunctions (POLY2D, GEOPOLY2D, CONTAINS)
• 7. added GROUP_CONCAT()
• 8. added OPTIMIZE INDEX rtindex, rt_merge_iops, rt_merge_maxiosize
directives
• 9. added TRUNCATE RTINDEX statement
Saturday, August 17, 13
Sphinx - Postgres compilation
[root@ip-10-55-83-238 ~]# yum install gcc-c++.noarch
[root@ip-10-55-83-238 sphinx-2.0.6-release]# ./configure --prefix=/opt/sphinx --
without-mysql --with-pgsql-includes=$PGSQL_INCLUDE --with-pgsql-libs=
$PGSQL_LIBS --with-pgsql
[root@ip-10-55-83-238 sphinx]# /opt/pg/bin/psql -Upostgres -hmaster test <
etc/example-pg.sql
* Package is compiled with mysql libraries dependencies
Saturday, August 17, 13
Data source flow (from DBs)
- Connection to the database is established
- Pre-query, is executed to perform any necessary initial setup, such as setting
per-connection encoding with MySQL;
- main query is executed and the rows it returns are indexed;
- Post-query is executed to perform any necessary cleanup;
- connection to the database is closed;
- indexer does the sorting phase (to be pedantic, index-type specific post-
processing);
- connection to the database is established again;
- post-index query, is executed to perform any necessary final cleanup;
- connection to the database is closed again.
Saturday, August 17, 13
Sphinx - Daemon
• For speed
• to offload main database
• to make particular queries faster
• Actually most of search-related
• For failover
• It happens to best of us!
• For extended functionality
• Morphology & stemming
• Autocomplete, “do you mean” and “Similar items”
Saturday, August 17, 13
SOLR/Lucene
Saturday, August 17, 13
Solr Features
• Advanced Full-Text Search Capabilities
• Optimized for High Volume Web Traffic
• Standards Based Open Interfaces - XML, JSON and HTTP
• Comprehensive HTML Administration Interfaces
• Server statistics exposed over JMX for monitoring
• Linearly scalable, auto index replication, auto failover and recovery
• Near Real-time indexing
• Flexible and Adaptable with XML configuration
• Extensible Plugin Architecture
Saturday, August 17, 13
Solr
• http://lucene.apache.org/solr/features.html
• Solr uses Lucene Library
Saturday, August 17, 13
Thanks!
Contact us!
We are hiring!
emanuel@palominodb.com
Saturday, August 17, 13

Weitere ähnliche Inhalte

Was ist angesagt?

PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PGConf APAC
 
Extending Apache Spark – Beyond Spark Session Extensions
Extending Apache Spark – Beyond Spark Session ExtensionsExtending Apache Spark – Beyond Spark Session Extensions
Extending Apache Spark – Beyond Spark Session ExtensionsDatabricks
 
In-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several timesIn-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several timesAleksander Alekseev
 
Managing terabytes: When PostgreSQL gets big
Managing terabytes: When PostgreSQL gets bigManaging terabytes: When PostgreSQL gets big
Managing terabytes: When PostgreSQL gets bigSelena Deckelmann
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassCassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassDataStax
 
Postgresql Database Administration Basic - Day1
Postgresql  Database Administration Basic  - Day1Postgresql  Database Administration Basic  - Day1
Postgresql Database Administration Basic - Day1PoguttuezhiniVP
 
Developing and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDWDeveloping and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDWJonathan Katz
 
Materialized views in PostgreSQL
Materialized views in PostgreSQLMaterialized views in PostgreSQL
Materialized views in PostgreSQLAshutosh Bapat
 
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql databaseHypertable - massively scalable nosql database
Hypertable - massively scalable nosql databasebigdatagurus_meetup
 
PostgreSQL, your NoSQL database
PostgreSQL, your NoSQL databasePostgreSQL, your NoSQL database
PostgreSQL, your NoSQL databaseReuven Lerner
 
Webscale PostgreSQL - JSONB and Horizontal Scaling Strategies
Webscale PostgreSQL - JSONB and Horizontal Scaling StrategiesWebscale PostgreSQL - JSONB and Horizontal Scaling Strategies
Webscale PostgreSQL - JSONB and Horizontal Scaling StrategiesJonathan Katz
 
Database Architectures and Hypertable
Database Architectures and HypertableDatabase Architectures and Hypertable
Database Architectures and Hypertablehypertable
 
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiPostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiSatoshi Nagayasu
 
Hypertable
HypertableHypertable
Hypertablebetaisao
 
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and StreamingUsing Apache Spark to Solve Sessionization Problem in Batch and Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and StreamingDatabricks
 

Was ist angesagt? (20)

PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs
 
Extending Apache Spark – Beyond Spark Session Extensions
Extending Apache Spark – Beyond Spark Session ExtensionsExtending Apache Spark – Beyond Spark Session Extensions
Extending Apache Spark – Beyond Spark Session Extensions
 
In-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several timesIn-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several times
 
Full Text Search in PostgreSQL
Full Text Search in PostgreSQLFull Text Search in PostgreSQL
Full Text Search in PostgreSQL
 
Really Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DWReally Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DW
 
Xephon K A Time series database with multiple backends
Xephon K A Time series database with multiple backendsXephon K A Time series database with multiple backends
Xephon K A Time series database with multiple backends
 
Managing terabytes: When PostgreSQL gets big
Managing terabytes: When PostgreSQL gets bigManaging terabytes: When PostgreSQL gets big
Managing terabytes: When PostgreSQL gets big
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassCassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break Glass
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
Postgresql Database Administration Basic - Day1
Postgresql  Database Administration Basic  - Day1Postgresql  Database Administration Basic  - Day1
Postgresql Database Administration Basic - Day1
 
Developing and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDWDeveloping and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDW
 
Materialized views in PostgreSQL
Materialized views in PostgreSQLMaterialized views in PostgreSQL
Materialized views in PostgreSQL
 
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql databaseHypertable - massively scalable nosql database
Hypertable - massively scalable nosql database
 
PostgreSQL, your NoSQL database
PostgreSQL, your NoSQL databasePostgreSQL, your NoSQL database
PostgreSQL, your NoSQL database
 
Webscale PostgreSQL - JSONB and Horizontal Scaling Strategies
Webscale PostgreSQL - JSONB and Horizontal Scaling StrategiesWebscale PostgreSQL - JSONB and Horizontal Scaling Strategies
Webscale PostgreSQL - JSONB and Horizontal Scaling Strategies
 
Database Architectures and Hypertable
Database Architectures and HypertableDatabase Architectures and Hypertable
Database Architectures and Hypertable
 
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiPostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
 
Hypertable
HypertableHypertable
Hypertable
 
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and StreamingUsing Apache Spark to Solve Sessionization Problem in Batch and Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
 

Ähnlich wie PostgreSQL FTS Tools

Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrKai Chan
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Kai Chan
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQLSatoshi Nagayasu
 
How elephants survive in big data environments
How elephants survive in big data environmentsHow elephants survive in big data environments
How elephants survive in big data environmentsMary Prokhorova
 
Rspec API Documentation
Rspec API DocumentationRspec API Documentation
Rspec API DocumentationSmartLogic
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate WorksGoro Fuji
 
Postgres в основе вашего дата-центра, Bruce Momjian (EnterpriseDB)
Postgres в основе вашего дата-центра, Bruce Momjian (EnterpriseDB)Postgres в основе вашего дата-центра, Bruce Momjian (EnterpriseDB)
Postgres в основе вашего дата-центра, Bruce Momjian (EnterpriseDB)Ontico
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2PoguttuezhiniVP
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?Andrii Soldatenko
 
SPARQLing Services
SPARQLing ServicesSPARQLing Services
SPARQLing ServicesLeigh Dodds
 
Introducing Modern Perl
Introducing Modern PerlIntroducing Modern Perl
Introducing Modern PerlDave Cross
 
Elasticsearch in 15 minutes
Elasticsearch in 15 minutesElasticsearch in 15 minutes
Elasticsearch in 15 minutesDavid Pilato
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
Pxb For Yapc2008
Pxb For Yapc2008Pxb For Yapc2008
Pxb For Yapc2008maximgrp
 
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...Ontico
 
Perly Parsing with Regexp::Grammars
Perly Parsing with Regexp::GrammarsPerly Parsing with Regexp::Grammars
Perly Parsing with Regexp::GrammarsWorkhorse Computing
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 

Ähnlich wie PostgreSQL FTS Tools (20)

Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and Solr
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
 
How elephants survive in big data environments
How elephants survive in big data environmentsHow elephants survive in big data environments
How elephants survive in big data environments
 
Rspec API Documentation
Rspec API DocumentationRspec API Documentation
Rspec API Documentation
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate Works
 
Postgres в основе вашего дата-центра, Bruce Momjian (EnterpriseDB)
Postgres в основе вашего дата-центра, Bruce Momjian (EnterpriseDB)Postgres в основе вашего дата-центра, Bruce Momjian (EnterpriseDB)
Postgres в основе вашего дата-центра, Bruce Momjian (EnterpriseDB)
 
Kibana: Real-World Examples
Kibana: Real-World ExamplesKibana: Real-World Examples
Kibana: Real-World Examples
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?
 
SPARQLing Services
SPARQLing ServicesSPARQLing Services
SPARQLing Services
 
Introducing Modern Perl
Introducing Modern PerlIntroducing Modern Perl
Introducing Modern Perl
 
Elasticsearch in 15 minutes
Elasticsearch in 15 minutesElasticsearch in 15 minutes
Elasticsearch in 15 minutes
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Pxb For Yapc2008
Pxb For Yapc2008Pxb For Yapc2008
Pxb For Yapc2008
 
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
 
Logstash
LogstashLogstash
Logstash
 
Perly Parsing with Regexp::Grammars
Perly Parsing with Regexp::GrammarsPerly Parsing with Regexp::Grammars
Perly Parsing with Regexp::Grammars
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 

Mehr von Emanuel Calvo

Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL DatabasesEmanuel Calvo
 
Demystifying postgres logical replication percona live sc
Demystifying postgres logical replication percona live scDemystifying postgres logical replication percona live sc
Demystifying postgres logical replication percona live scEmanuel Calvo
 
Pgbr 2013 postgres on aws
Pgbr 2013   postgres on awsPgbr 2013   postgres on aws
Pgbr 2013 postgres on awsEmanuel Calvo
 
LSWC PostgreSQL 9.1 (2011)
LSWC PostgreSQL 9.1 (2011)LSWC PostgreSQL 9.1 (2011)
LSWC PostgreSQL 9.1 (2011)Emanuel Calvo
 
Monitoreo de MySQL y PostgreSQL con SQL
Monitoreo de MySQL y PostgreSQL con SQLMonitoreo de MySQL y PostgreSQL con SQL
Monitoreo de MySQL y PostgreSQL con SQLEmanuel Calvo
 

Mehr von Emanuel Calvo (7)

Open Source SQL Databases
Open Source SQL DatabasesOpen Source SQL Databases
Open Source SQL Databases
 
Demystifying postgres logical replication percona live sc
Demystifying postgres logical replication percona live scDemystifying postgres logical replication percona live sc
Demystifying postgres logical replication percona live sc
 
Pgbr 2013 postgres on aws
Pgbr 2013   postgres on awsPgbr 2013   postgres on aws
Pgbr 2013 postgres on aws
 
LSWC PostgreSQL 9.1 (2011)
LSWC PostgreSQL 9.1 (2011)LSWC PostgreSQL 9.1 (2011)
LSWC PostgreSQL 9.1 (2011)
 
Admon PG 1
Admon PG 1Admon PG 1
Admon PG 1
 
Monitoreo de MySQL y PostgreSQL con SQL
Monitoreo de MySQL y PostgreSQL con SQLMonitoreo de MySQL y PostgreSQL con SQL
Monitoreo de MySQL y PostgreSQL con SQL
 
Osol Pgsql
Osol PgsqlOsol Pgsql
Osol Pgsql
 

Kürzlich hochgeladen

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

PostgreSQL FTS Tools

  • 1. PostgreSQL FTS PGBR 2013 by Emanuel Calvo Saturday, August 17, 13
  • 2. Palomino - Service Offerings • Monthly Support: o Being renamed to Palomino DBA as a service. o Eliminating 10 hour monthly clients. o Discounts are based on spend per month (0-80, 81-160, 161+ o We will be penalizing excessive paging financially. o Quarterly onsite day from Palomino executive, DBA and PM for clients using 80 hours or more per month. o Clients using 80-160 hours get 2 New Relic licenses. 160 hours plus get 4. • Adding annual support contracts: o Consultation as needed. o Emergency pages allowed. o Small bucket of DBA hours (8, 16 or 24) For more information, please go to: Spreadsheet “Advanced Technology Partner” for Amazon Web Services Saturday, August 17, 13
  • 3. About me: • Operational DBA at PalominoDB. • MySQL, Maria and PostgreSQL databases (and others) • Community member • Check out my LinkedIn Profile at: http://es.linkedin.com/in/ecbcbcb/ Saturday, August 17, 13
  • 4. Credits • Thanks to: o Andrew Atanasoff o Vlad Fedorkov o All the PalominoDB people that help out ! Saturday, August 17, 13
  • 5. Agenda • What we are looking for? • Concepts • Native Postgres Support o http://www.postgresql.org/docs/9.2/static/textsearch.html • External solutions o Sphinx  http://sphinxsearch.com/ o Solr  http://lucene.apache.org/solr/ Saturday, August 17, 13
  • 6. What are we looking for? • Finish the talk knowing what FTS is for. • Have an idea which are our tools. • Which of those tools are the best fit. Saturday, August 17, 13
  • 7. Goals of FTS • Add complex searches using synonyms, specific operators or spellings. o Improving performance sacrificing accuracy. • Reduce IO and CPU utilization. o Text consumes a lot of IO for read and CPU for operations. • FTS can be handled: o Externally  using tools like Sphinx or Solr  Ideal for massive text search, simple queries o Internally  native FTS support.  Ideal for complex queries combining with other business rules • Order words by relevance • Language sensitive • Faster than regular expressions or LIKE operands Saturday, August 17, 13
  • 8. The future of native FTS • https://wiki.postgresql.org/wiki/PGCon2013_Unconference_Future_of_Full- Text_Search • http://wiki.postgresql.org/images/2/25/Full- text_search_in_PostgreSQL_in_milliseconds-extended-version.pdf • 150 Kb patch for 9.3 • GIN/GIST interface improved Saturday, August 17, 13
  • 9. WHY FTS?? You may be asking this now. Saturday, August 17, 13
  • 10. ... and the answer is this reaction when I see a “LIKE ‘%pattern%’” Saturday, August 17, 13
  • 11. Concepts • Parsers o 23 token types (url, email, file, etc) • Token • Stop word • Lexeme o array of lexemes + position + weight = tsvector • Dictionaries o Simple Dictionary  The simple dictionary template operates by converting the input token to lower case and checking it against a file of stop words. o Synonym Dictionary o Thesaurus Dictionary o Ispell Dictionary o Snowball Dictionary Saturday, August 17, 13
  • 12. Limitations • The length of each lexeme must be less than 2K bytes • The length of a tsvector (lexemes + positions) must be less than 1 megabyte • The number of lexemes must be less than 264 • Position values in tsvector must be greater than 0 and no more than 16,383 • No more than 256 positions per lexeme • The number of nodes (lexemes + operators) in a tsquery must be less than 32,768 • Those limits are hard to be reached! For comparison, the PostgreSQL 8.1 documentation contained 10,441 unique words, a total of 335,420 words, and the most frequent word “postgresql” was mentioned 6,127 times in 655 documents. Another example — the PostgreSQL mailing list archives contained 910,989 unique words with 57,491,343 lexemes in 461,020 messages. Saturday, August 17, 13
  • 13. Elements dF[+] [PATTERN] list text search configurations dFd[+] [PATTERN] list text search dictionaries dFp[+] [PATTERN] list text search parsers dFt[+] [PATTERN] list text search templates full_text_search=# dFd+ * List of text search dictionaries Schema | Name | Template | Init options | Description ------------+-----------------+---------------------+---------------------------------------------------+----------------------------------------------------------- pg_catalog | danish_stem | pg_catalog.snowball | language = 'danish', stopwords = 'danish' | snowball stemmer for danish language pg_catalog | dutch_stem | pg_catalog.snowball | language = 'dutch', stopwords = 'dutch' | snowball stemmer for dutch language pg_catalog | english_stem | pg_catalog.snowball | language = 'english', stopwords = 'english' | snowball stemmer for english language pg_catalog | finnish_stem | pg_catalog.snowball | language = 'finnish', stopwords = 'finnish' | snowball stemmer for finnish language pg_catalog | french_stem | pg_catalog.snowball | language = 'french', stopwords = 'french' | snowball stemmer for french language pg_catalog | german_stem | pg_catalog.snowball | language = 'german', stopwords = 'german' | snowball stemmer for german language pg_catalog | hungarian_stem | pg_catalog.snowball | language = 'hungarian', stopwords = 'hungarian' | snowball stemmer for hungarian language pg_catalog | italian_stem | pg_catalog.snowball | language = 'italian', stopwords = 'italian' | snowball stemmer for italian language pg_catalog | norwegian_stem | pg_catalog.snowball | language = 'norwegian', stopwords = 'norwegian' | snowball stemmer for norwegian language pg_catalog | portuguese_stem | pg_catalog.snowball | language = 'portuguese', stopwords = 'portuguese' | snowball stemmer for portuguese language pg_catalog | romanian_stem | pg_catalog.snowball | language = 'romanian' | snowball stemmer for romanian language pg_catalog | russian_stem | pg_catalog.snowball | language = 'russian', stopwords = 'russian' | snowball stemmer for russian language pg_catalog | simple | pg_catalog.simple | | simple dictionary: just lower case and check for stopword pg_catalog | spanish_stem | pg_catalog.snowball | language = 'spanish', stopwords = 'spanish' | snowball stemmer for spanish language pg_catalog | swedish_stem | pg_catalog.snowball | language = 'swedish', stopwords = 'swedish' | snowball stemmer for swedish language pg_catalog | turkish_stem | pg_catalog.snowball | language = 'turkish', stopwords = 'turkish' | snowball stemmer for turkish language (16 rows) Saturday, August 17, 13
  • 14. Elements postgres=# dF List of text search configurations Schema | Name | Description ------------+------------+--------------------------------------- pg_catalog | danish | configuration for danish language pg_catalog | dutch | configuration for dutch language pg_catalog | english | configuration for english language pg_catalog | finnish | configuration for finnish language pg_catalog | french | configuration for french language pg_catalog | german | configuration for german language pg_catalog | hungarian | configuration for hungarian language pg_catalog | italian | configuration for italian language pg_catalog | norwegian | configuration for norwegian language pg_catalog | portuguese | configuration for portuguese language pg_catalog | romanian | configuration for romanian language pg_catalog | russian | configuration for russian language pg_catalog | simple | simple configuration pg_catalog | spanish | configuration for spanish language pg_catalog | swedish | configuration for swedish language pg_catalog | turkish | configuration for turkish language (16 rows) Saturday, August 17, 13
  • 15. Elements List of data types Schema | Name | Description ------------+-----------+--------------------------------------------------------- pg_catalog | gtsvector | GiST index internal text representation for text search pg_catalog | tsquery | query representation for text search pg_catalog | tsvector | text representation for text search (3 rows) Some operators: • @@ (tsvector against tsquery) • || concatenate tsvectors (it reorganises lexemes and ranking) Saturday, August 17, 13
  • 16. Small Example full_text_search=# create table basic_example (i serial PRIMARY KEY, whole text, fulled tsvector, dictionary regconfig); postgres=# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON basic_example FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(fulled, "pg_catalog.english", whole); CREATE TRIGGER postgres=# insert into basic_example(whole,dictionary) values ('This is an example','english'::regconfig); INSERT 0 1 full_text_search=# create index on basic_example(to_tsvector(dictionary,whole)); CREATE INDEX full_text_search=# create index on basic_example using GIST(to_tsvector(dictionary,whole)); CREATE INDEX postgres=# select * from basic_example; i | whole | fulled | dictionary ---+--------------------+------------+------------ 5 | This is an example | 'exampl':4 | english (1 row) Saturday, August 17, 13
  • 17. Pre processing • Documents into tokens  Find and clean • Tokens into lexemes o Token normalised to a language or dictionary o Eliminate stop words ( high frequently words) • Storing o Array of lexemes (tsvector)  the position of the word respect the presence of stop words, although they are not stored  Stores positional information for proximity info Saturday, August 17, 13
  • 18. Highlighting • ts_headline o it doesn't use tsvector and needs to use the entire document, so could be expensive. • Only for certain type of queries or titles postgres=# SELECT ts_headline('english','Just a simple example of a highlighted query and similarity.',to_tsquery('query & similarity'),'StartSel = <, StopSel = >'); ts_headline ------------------------------------------------------------------ Just a simple example of a highlighted <query> and <similarity>. (1 row) Default: StartSel=<b>, StopSel=</b>, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE, MaxFragments=0, FragmentDelimiter=" ... " Saturday, August 17, 13
  • 19. Ranking • Weights: (A B C D) • Ranking functions: o ts_rank o ts_rank_cd • Ranking is expensive cause re process and check each tsvector. SELECT to_tsquery(’english’, ’Fat | Rats:AB’); to_tsquery ------------------ ’fat’ | ’rat’:AB Also, * can be attached to a lexeme to specify prefix matching: SELECT to_tsquery(’supern:*A & star:A*B’); to_tsquery -------------------------- ’supern’:*A & ’star’:*AB Saturday, August 17, 13
  • 20. Maniputaling tsvectors and tsquery • Manipulating tsvectors o setweight(vector tsvector, weight "char") returns tsvector o lenght (tsvector) : number of lexemes o strip (tsvector): returns tsvector without additional position as weight or position • Manipulating Queries • If you need a dynamic input for a query, parse it with numnode(tsquery), it will avoid unnecessary searches if contains a lot of stop words o numnode(plainto_tsquery(’a the is’)) o clean the queries using querytree also, is useful Saturday, August 17, 13
  • 21. Example postgres=# select * from ts_debug('english','The doctor saids I''m sick.'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+--------+----------------+--------------+---------- asciiword | Word, all ASCII | The | {english_stem} | english_stem | {} blank | Space symbols | | {} | | asciiword | Word, all ASCII | doctor | {english_stem} | english_stem | {doctor} blank | Space symbols | | {} | | asciiword | Word, all ASCII | saids | {english_stem} | english_stem | {said} blank | Space symbols | | {} | | asciiword | Word, all ASCII | I | {english_stem} | english_stem | {} blank | Space symbols | ' | {} | | asciiword | Word, all ASCII | m | {english_stem} | english_stem | {m} blank | Space symbols | | {} | | asciiword | Word, all ASCII | sick | {english_stem} | english_stem | {sick} blank | Space symbols | . | {} | | (12 rows) postgres=# select numnode(plainto_tsquery('The doctor saids I''m sick.')), plainto_tsquery('The doctor saids I''m sick.'), to_tsvector('english','The doctor saids I''m sick.'), ts_lexize('english_stem','The doctor saids I''m sick.'); numnode | plainto_tsquery | to_tsvector | ts_lexize ---------+----------------------------------+------------------------------------+-------------------------------- 7 | 'doctor' & 'said' & 'm' & 'sick' | 'doctor':2 'm':5 'said':3 'sick':6 | {"the doctor saids i'm sick."} (1 row) Saturday, August 17, 13
  • 22. Maniputaling tsquery postgres=# SELECT querytree(to_tsquery('!defined')); querytree ----------- T (1 row) postgres=# SELECT querytree(to_tsquery('cat & food | (dog & run & food)')); querytree ----------------------------------------- 'cat' & 'food' | 'dog' & 'run' & 'food' (1 row) postgres=# SELECT querytree(to_tsquery('the ')); NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored querytree ----------- (1 row) Saturday, August 17, 13
  • 23. Automating updates on tsvector • Postgresql provide standard functions for this: o tsvector_update_trigger(tsvector_column_name, config_name, text_column_name [, ... ]) o tsvector_update_trigger_column(tsvector_column_name, config_column_name, text_column_name [, ... CREATE TABLE messages ( title text, body text, tsv tsvector ); CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON messages FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger(tsv, ’pg_catalog.english’, title, body); Saturday, August 17, 13
  • 24. Automating updates on tsvector (2) If you want to keep a custom weight: CREATE FUNCTION messages_trigger() RETURNS trigger AS $$ begin new.tsv := setweight(to_tsvector(’pg_catalog.english’, coalesce(new.title,”)), ’A’) || setweight(to_tsvector(’pg_catalog.english’, coalesce(new.body,”)), ’D’); return new; end $$ LANGUAGE plpgsql; CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON messages FOR EACH ROW EXECUTE PROCEDURE messages_trigger(); Saturday, August 17, 13
  • 25. Tips and considerations • Store the text externally, index on the database o requires superuser • Store the whole document on the database, index on Sphinx/Solr • Don't index everything o Solr /Sphinx are not databases, just index only what you want to search. Smaller indexes are faster and easy to maintain. • ts_stats o can help you out to check your FTS configuration • You can parse URLS, mails and whatever using ts_debug function for nun intensive operations Saturday, August 17, 13
  • 26. Tips and considerations • You can index by language CREATE INDEX pgweb_idx_en ON pgweb USING gin(to_tsvector(’english’, body)) WHERE config_language = 'english'; CREATE INDEX pgweb_idx_fr ON pgweb USING gin(to_tsvector(’french’, body)) WHERE config_language = 'french'; CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(config_language, body)); CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(’english’, title || ’ ’ || body)); • Table partition using language is also a good practice Saturday, August 17, 13
  • 27. Features on 9.2 • Move tsvector most-common-element statistics to new pg_stats columns (Alexander Korotkov) • Consult most_common_elems and most_common_elem_freqs for the data formerly available in most_common_vals and most_common_freqs for a tsvector column. most_common_elems | {exampl} most_common_elem_freqs | {1,1,1} Saturday, August 17, 13
  • 30. Sphinx • Standalone daemon written on C++ • Highly scalable o Known installation consists 50+ Boxes, 20+ Billions of documents • Extended search for text and non-full-text data o Optimized for faceted search o Snippets generation based on language settings • Very fast o Keeps attributes in memory  See Percona benchmarks for details • Receiving data from PostgreSQL o Dedicated PostgreSQL datasource type. http://sphinxsearch.com Saturday, August 17, 13
  • 31. Key features- Sphinx • Scalability & failover • Extended FT language • Faceted search support • GEO-search support • Integration and pluggable architecture • Dedicated PostgreSQL source, UDF support • Morphology & stemming • Both batch & real-time indexing is available • Parallel snippets generation Saturday, August 17, 13
  • 32. Sphinx - Basic 1 host architecture Postgres Application sphinx daemon/API listen = 9312 listen = 9306:mysql41 Indexes Query Result Additional Data Indexing ASYNC Saturday, August 17, 13
  • 33. What's new on Sphinx • 1. added AOT (new morphology library, lemmatizer) support o Russian only for now; English coming soon; small 10-20% indexing impact; it's all about search quality (much much better "stemming") • 2. added JSON support o limited support (limited subset of JSON) for now; JSON sits in a column; you're able to do thing like WHERE jsoncol.key=123 or ORDER BY or GROUP BY • 3. added subselect syntax that reorders result sets, SELECT * FROM (SELECT ... ORDER BY cond1 LIMIT X) ORDER BY cond2 LIMIT Y • 4. added bigram indexing, and quicker phrase searching with bigrams (bigram_index, bigram_freq_words directives) o improves the worst cases for social mining • 5. added HA support, ha_strategy, agent_mirror directives • 6. added a few new geofunctions (POLY2D, GEOPOLY2D, CONTAINS) • 7. added GROUP_CONCAT() • 8. added OPTIMIZE INDEX rtindex, rt_merge_iops, rt_merge_maxiosize directives • 9. added TRUNCATE RTINDEX statement Saturday, August 17, 13
  • 34. Sphinx - Postgres compilation [root@ip-10-55-83-238 ~]# yum install gcc-c++.noarch [root@ip-10-55-83-238 sphinx-2.0.6-release]# ./configure --prefix=/opt/sphinx -- without-mysql --with-pgsql-includes=$PGSQL_INCLUDE --with-pgsql-libs= $PGSQL_LIBS --with-pgsql [root@ip-10-55-83-238 sphinx]# /opt/pg/bin/psql -Upostgres -hmaster test < etc/example-pg.sql * Package is compiled with mysql libraries dependencies Saturday, August 17, 13
  • 35. Data source flow (from DBs) - Connection to the database is established - Pre-query, is executed to perform any necessary initial setup, such as setting per-connection encoding with MySQL; - main query is executed and the rows it returns are indexed; - Post-query is executed to perform any necessary cleanup; - connection to the database is closed; - indexer does the sorting phase (to be pedantic, index-type specific post- processing); - connection to the database is established again; - post-index query, is executed to perform any necessary final cleanup; - connection to the database is closed again. Saturday, August 17, 13
  • 36. Sphinx - Daemon • For speed • to offload main database • to make particular queries faster • Actually most of search-related • For failover • It happens to best of us! • For extended functionality • Morphology & stemming • Autocomplete, “do you mean” and “Similar items” Saturday, August 17, 13
  • 38. Solr Features • Advanced Full-Text Search Capabilities • Optimized for High Volume Web Traffic • Standards Based Open Interfaces - XML, JSON and HTTP • Comprehensive HTML Administration Interfaces • Server statistics exposed over JMX for monitoring • Linearly scalable, auto index replication, auto failover and recovery • Near Real-time indexing • Flexible and Adaptable with XML configuration • Extensible Plugin Architecture Saturday, August 17, 13
  • 39. Solr • http://lucene.apache.org/solr/features.html • Solr uses Lucene Library Saturday, August 17, 13
  • 40. Thanks! Contact us! We are hiring! emanuel@palominodb.com Saturday, August 17, 13