SlideShare ist ein Scribd-Unternehmen logo
1 von 57
Downloaden Sie, um offline zu lesen
What is the best
full text search engine
for Python?
Andrii Soldatenko
23-24 April 2016
@a_soldatenko
Agenda:
• Who am I?
• What is full text search?
• PostgreSQL FTS / Elastic / Whoosh
• Pros and Cons
• What’s next?
Andrii Soldatenko
• Backend Python Developer at
• CTO in Persollo.com
• Speaker at many PyCons and
Python meetups
• blogger at https://asoldatenko.com
Preface
Text Search
➜ cpython time ack OrderedDict
ack OrderedDict 2.53s user 0.22s system 94% cpu 2.915 total
➜ cpython time pt OrderedDict
pt OrderedDict 0.14s user 0.12s system 406% cpu 0.064 total
➜ cpython time pss OrderedDict
pss OrderedDict 1.08s user 0.14s system 88% cpu 1.370 total
➜ cpython time grep -r -i 'OrderedDict' .
grep -r -i 'OrderedDict' 2.70s user 0.13s system 94% cpu 2.998 total
Full text search
Search index
Simple sentences
1. The quick brown fox jumped over the lazy dog
2. Quick brown foxes leap over lazy dogs in summer
Inverted index
Inverted index
Inverted index:
normalization
Term Doc_1 Doc_2
-------------------------
brown | X | X
dog | X | X
fox | X | X
in | | X
jump | X | X
lazy | X | X
over | X | X
quick | X | X
summer | | X
the | X | X
------------------------
Term Doc_1 Doc_2
-------------------------
Quick | | X
The | X |
brown | X | X
dog | X |
dogs | | X
fox | X |
foxes | | X
in | | X
jumped | X |
lazy | X | X
leap | | X
over | X | X
quick | X |
summer | | X
the | X |
------------------------
Search Engines
PostgreSQL
Full Text Search
support from version 8.3
PostgreSQL
Full Text Search
SELECT to_tsvector('text') @@
to_tsquery('query');
Simple is better than complex. - by import this
SELECT 'python conference ukraine 2016'::tsvector @@
'python & ukraine'::tsquery;
?column?
----------
t
(1 row)
Do PostgreSQL FTS
without index
Do PostgreSQL FTS
with index
CREATE INDEX name ON table USING GIN
(column);
CREATE INDEX name ON table USING
GIST (column);
PostgreSQL FTS:

Ranking Search Results
ts_rank() -> float4 - based on the
frequency of their matching lexemes
ts_rank_cd() -> float4 - cover
density ranking for the given
document vector and query
PostgresSQL FTS
Highlighting Results
SELECT ts_headline('english',
'python conference ukraine 2016',
to_tsquery('python & 2016'));
ts_headline
----------------------------------------------
<b>python</b> conference ukraine <b>2016</b>
Stop Words
postgresql/9.5.2/share/postgresql/tsearch_data/english.stop
PostgresSQL FTS
Stop Words
SELECT to_tsvector('in the list of stop words');
to_tsvector
----------------------------
'list':3 'stop':5 'word':6
PG FTS

and Python
• Django 1.10 django.contrib.postgres.search
(36 hours ago)
• djorm-ext-pgfulltext
• sqlalchemy
PostgreSQL FTS
integration with django orm
https://github.com/linuxlewis/djorm-ext-pgfulltext
from djorm_pgfulltext.models import SearchManager
from djorm_pgfulltext.fields import VectorField
from django.db import models
class Page(models.Model):
name = models.CharField(max_length=200)
description = models.TextField()
search_index = VectorField()
objects = SearchManager(
fields = ('name', 'description'),
config = 'pg_catalog.english', # this is default
search_field = 'search_index', # this is default
auto_update_search_field = True
)
For search just use search
method of the manager
https://github.com/linuxlewis/djorm-ext-pgfulltext
>>> Page.objects.search("documentation & about")
[<Page: Page: Home page>]
>>> Page.objects.search("about | documentation | django | home", raw=True)
[<Page: Page: Home page>, <Page: Page: About>, <Page: Page: Navigation>]
Django 1.10
>>> Entry.objects.filter(body_text__search='recipe')
[<Entry: Cheese on Toast recipes>, <Entry: Pizza
recipes>]
>>> Entry.objects.annotate(
... search=SearchVector('blog__tagline',
'body_text'),
... ).filter(search='cheese')
[
<Entry: Cheese on Toast recipes>,
<Entry: Pizza Recipes>,
<Entry: Dairy farming in Argentina>,
]
https://github.com/django/django/commit/2d877da
Pros and Cons
Pros:
• Quick implementation
• No dependency
Cons:
• Need manually manage indexes
• depend on PostgreSQL
• no analytics data
• no DSL only `&` and `|` queries
• difficult to manage stop words
ElasticSearch
Who uses ElasticSearch?
ElasticSearch:
Quick Intro
Relational DB Databases TablesRows Columns
ElasticSearch Indices FieldsTypes Documents
ElasticSearch:
Locks
•Pessimistic concurrency control
•Optimistic concurrency control
ElasticSearch and
Python
• elasticsearch-py
• elasticsearch-dsl-py by Honza Kral
• elasticsearch-py-async by Honza Kral
ElasticSearch:
FTS
$ curl -XGET 'http://localhost:9200/
pyconua/talk/_search' -d '
{
    "query": {
        "match": {
            "user": "Andrii"
        }
    }
}'
ES: Create Index
$ curl -XPUT 'http://localhost:9200/
twitter/' -d '{
    "settings" : {
        "index" : {
            "number_of_shards" : 3,
            "number_of_replicas" : 2
        }
    }
}'
ES: Add json to Index
$ curl -XPUT 'http://localhost:9200/
pyconua/talk/1' -d '{
    "user" : "andrii",
    "description" : "Full text search"
}'
ES: Stopwords
$ curl -XPUT 'http://localhost:9200/pyconua' -d '{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type": "english",
          "stopwords_path": "stopwords/english.txt"
        }
      }
    }
  }
}'
ES: Highlight
$ curl -XGET 'http://localhost:9200/pyconua/talk/
_search' -d '{
    "query" : {...},
    "highlight" : {
        "pre_tags" : ["<tag1>"],
        "post_tags" : ["</tag1>"],
        "fields" : {
            "_all" : {}
        }
    }
}'
ES: Relevance
$ curl -XGET 'http://localhost:9200/_search?explain -d
'
{
"query" : { "match" : { "user" : "andrii" }}
}'
"_explanation": {
  "description": "weight(tweet:honeymoon in 0)
                  [PerFieldSimilarity], result of:",
  "value": 0.076713204,
  "details": [...]
}
Whoosh
• Pure-Python
• Whoosh was created by Matt Chaput.
• Pluggable scoring algorithm (including BM25F)
• more info at video from PyCon US 2013
Whoosh: Stop words
import os.path
import textwrap
names = os.listdir("stopwords")
for name in names:
f = open("stopwords/" + name)
wordls = [line.strip() for line in f]
words = " ".join(wordls)
print '"%s": frozenset(u"""' % name
print textwrap.fill(words, 72)
print '""".split())'
http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/
snowball/stopwords/
Whoosh: 

Highlight
results = pycon.search(myquery)
for hit in results:
print(hit["title"])
# Assume "content" field is stored
print(hit.highlights("content"))
Whoosh: 

Ranking search results
• Pluggable scoring algorithm
• including BM25F
Haystack
Adding search functionality
to Simple Model
$ cat myapp/models.py
from django.db import models
from django.contrib.auth.models import User
class Page(models.Model):
user = models.ForeignKey(User)
name = models.CharField(max_length=200)
description = models.TextField()
def __unicode__(self):
return self.name
Haystack: Installation
$ pip install django-haystack
$ cat settings.py
INSTALLED_APPS = [
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.sites',
# Added.
'haystack',
# Then your usual apps...
'blog',
]
Haystack: Installation
$ pip install elasticsearch
$ cat settings.py
...
HAYSTACK_CONNECTIONS = {
'default': {
'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine',
'URL': 'http://127.0.0.1:9200/',
'INDEX_NAME': 'haystack',
},
}
...
Haystack:
Creating SearchIndexes
$ cat myapp/search_indexes.py
import datetime
from haystack import indexes
from myapp.models import Note
class PageIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
author = indexes.CharField(model_attr='user')
pub_date = indexes.DateTimeField(model_attr='pub_date')
def get_model(self):
return Note
def index_queryset(self, using=None):
"""Used when the entire index for model is updated."""
return self.get_model().objects. 
filter(pub_date__lte=datetime.datetime.now())
Haystack:
SearchQuerySet API
from haystack.query import SearchQuerySet
from haystack.inputs import Raw
all_results = SearchQuerySet().all()
hello_results = SearchQuerySet().filter(content='hello')
unfriendly_results = SearchQuerySet().
exclude(content=‘hello’).
filter(content=‘world’)
# To send unescaped data:
sqs = SearchQuerySet().filter(title=Raw(trusted_query))
Keeping data in sync
# Update everything.
./manage.py update_index --settings=settings.prod
# Update everything with lots of information about what's going on.
./manage.py update_index --settings=settings.prod --verbosity=2
# Update everything, cleaning up after deleted models.
./manage.py update_index --remove --settings=settings.prod
# Update everything changed in the last 2 hours.
./manage.py update_index --age=2 --settings=settings.prod
# Update everything between Dec. 1, 2011 & Dec 31, 2011
./manage.py update_index --start='2011-12-01T00:00:00' --end='2011-12-31T23:59:59' --
settings=settings.prod
Signals
class RealtimeSignalProcessor(BaseSignalProcessor):
"""
Allows for observing when saves/deletes fire & automatically updates the
search engine appropriately.
"""
def setup(self):
# Naive (listen to all model saves).
models.signals.post_save.connect(self.handle_save)
models.signals.post_delete.connect(self.handle_delete)
# Efficient would be going through all backends & collecting all models
# being used, then hooking up signals only for those.
def teardown(self):
# Naive (listen to all model saves).
models.signals.post_save.disconnect(self.handle_save)
models.signals.post_delete.disconnect(self.handle_delete)
# Efficient would be going through all backends & collecting all models
# being used, then disconnecting signals only for those.
Haystack:
Pros and Cons
Pros:
• easy to setup
• looks like Django ORM but for searches
• search engine independent
• support 4 engines (Elastic, Solr, Xapian, Whoosh)
Cons:
• poor SearchQuerySet API
• difficult to manage stop words
• loose performance, because extra layer
• Model - based
Results
Python 

clients
Python 3
Django

support
elasticsearch-py

elasticsearch-dsl-
py

elasticsearch-py-
async
YES
haystack +

elasticstack

psycopg2 YES
djorm-ext-
pgfulltext

django.contrib.po
stgres
Whoosh YES
support using
haystack
ResultsIndexes Without indexes
PUT /index/ No support
GIN/GIST to_tsvector()
index folder No support
Results
ranking /
relevance
Configure

Stopwords
highlight
search
results
TF/IDF YES YES
cd_rank YES YES
Okapi BM25 YES YES
ResultsSynonyms Scale
YES YES
NO SUPPORT I’m not sure
NO SUPPORT NO
Final Thoughts
Questions
?
Thank You
andrii.soldatenko@toptal.com
@a_soldatenko
https://asoldatenko.com
We are hiring
https://www.toptal.com/#connect-
fantastic-computer-engineers

Weitere ähnliche Inhalte

Mehr von Andrii Soldatenko

What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?Andrii Soldatenko
 
PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.Andrii Soldatenko
 
PyCon 2015 Belarus Andrii Soldatenko
PyCon 2015 Belarus Andrii SoldatenkoPyCon 2015 Belarus Andrii Soldatenko
PyCon 2015 Belarus Andrii SoldatenkoAndrii Soldatenko
 
SeleniumCamp 2015 Andrii Soldatenko
SeleniumCamp 2015 Andrii SoldatenkoSeleniumCamp 2015 Andrii Soldatenko
SeleniumCamp 2015 Andrii SoldatenkoAndrii Soldatenko
 

Mehr von Andrii Soldatenko (6)

What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?
 
Kyiv.py #16 october 2015
Kyiv.py #16 october 2015Kyiv.py #16 october 2015
Kyiv.py #16 october 2015
 
PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.
 
PyCon 2015 Belarus Andrii Soldatenko
PyCon 2015 Belarus Andrii SoldatenkoPyCon 2015 Belarus Andrii Soldatenko
PyCon 2015 Belarus Andrii Soldatenko
 
PyCon Ukraine 2014
PyCon Ukraine 2014PyCon Ukraine 2014
PyCon Ukraine 2014
 
SeleniumCamp 2015 Andrii Soldatenko
SeleniumCamp 2015 Andrii SoldatenkoSeleniumCamp 2015 Andrii Soldatenko
SeleniumCamp 2015 Andrii Soldatenko
 

Kürzlich hochgeladen

SEO Growth Program-Digital optimization Specialist
SEO Growth Program-Digital optimization SpecialistSEO Growth Program-Digital optimization Specialist
SEO Growth Program-Digital optimization SpecialistKHM Anwar
 
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
horny (9316020077 ) Goa Call Girls Service by VIP Call Girls in Goa
horny (9316020077 ) Goa  Call Girls Service by VIP Call Girls in Goahorny (9316020077 ) Goa  Call Girls Service by VIP Call Girls in Goa
horny (9316020077 ) Goa Call Girls Service by VIP Call Girls in Goasexy call girls service in goa
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445ruhi
 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Delhi Call girls
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...SofiyaSharma5
 
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girlsstephieert
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024APNIC
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebJames Anderson
 
How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)Damian Radcliffe
 
Networking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGNetworking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGAPNIC
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Servicegwenoracqe6
 
Radiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girlsRadiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girlsstephieert
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.soniya singh
 
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607dollysharma2066
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 

Kürzlich hochgeladen (20)

SEO Growth Program-Digital optimization Specialist
SEO Growth Program-Digital optimization SpecialistSEO Growth Program-Digital optimization Specialist
SEO Growth Program-Digital optimization Specialist
 
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
 
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
horny (9316020077 ) Goa Call Girls Service by VIP Call Girls in Goa
horny (9316020077 ) Goa  Call Girls Service by VIP Call Girls in Goahorny (9316020077 ) Goa  Call Girls Service by VIP Call Girls in Goa
horny (9316020077 ) Goa Call Girls Service by VIP Call Girls in Goa
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
 
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
10.pdfMature Call girls in Dubai +971563133746 Dubai Call girls
 
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
 
How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)
 
Networking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGNetworking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOG
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
 
Radiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girlsRadiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girls
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls In Noida 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In Noida 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In Noida 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In Noida 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
 

PyCon UA 2016

  • 1. What is the best full text search engine for Python? Andrii Soldatenko 23-24 April 2016 @a_soldatenko
  • 2. Agenda: • Who am I? • What is full text search? • PostgreSQL FTS / Elastic / Whoosh • Pros and Cons • What’s next?
  • 3. Andrii Soldatenko • Backend Python Developer at • CTO in Persollo.com • Speaker at many PyCons and Python meetups • blogger at https://asoldatenko.com
  • 5. Text Search ➜ cpython time ack OrderedDict ack OrderedDict 2.53s user 0.22s system 94% cpu 2.915 total ➜ cpython time pt OrderedDict pt OrderedDict 0.14s user 0.12s system 406% cpu 0.064 total ➜ cpython time pss OrderedDict pss OrderedDict 1.08s user 0.14s system 88% cpu 1.370 total ➜ cpython time grep -r -i 'OrderedDict' . grep -r -i 'OrderedDict' 2.70s user 0.13s system 94% cpu 2.998 total
  • 8. Simple sentences 1. The quick brown fox jumped over the lazy dog 2. Quick brown foxes leap over lazy dogs in summer
  • 11. Inverted index: normalization Term Doc_1 Doc_2 ------------------------- brown | X | X dog | X | X fox | X | X in | | X jump | X | X lazy | X | X over | X | X quick | X | X summer | | X the | X | X ------------------------ Term Doc_1 Doc_2 ------------------------- Quick | | X The | X | brown | X | X dog | X | dogs | | X fox | X | foxes | | X in | | X jumped | X | lazy | X | X leap | | X over | X | X quick | X | summer | | X the | X | ------------------------
  • 14. PostgreSQL Full Text Search SELECT to_tsvector('text') @@ to_tsquery('query'); Simple is better than complex. - by import this
  • 15. SELECT 'python conference ukraine 2016'::tsvector @@ 'python & ukraine'::tsquery; ?column? ---------- t (1 row) Do PostgreSQL FTS without index
  • 16. Do PostgreSQL FTS with index CREATE INDEX name ON table USING GIN (column); CREATE INDEX name ON table USING GIST (column);
  • 17. PostgreSQL FTS:
 Ranking Search Results ts_rank() -> float4 - based on the frequency of their matching lexemes ts_rank_cd() -> float4 - cover density ranking for the given document vector and query
  • 18. PostgresSQL FTS Highlighting Results SELECT ts_headline('english', 'python conference ukraine 2016', to_tsquery('python & 2016')); ts_headline ---------------------------------------------- <b>python</b> conference ukraine <b>2016</b>
  • 20. PostgresSQL FTS Stop Words SELECT to_tsvector('in the list of stop words'); to_tsvector ---------------------------- 'list':3 'stop':5 'word':6
  • 21. PG FTS
 and Python • Django 1.10 django.contrib.postgres.search (36 hours ago) • djorm-ext-pgfulltext • sqlalchemy
  • 22. PostgreSQL FTS integration with django orm https://github.com/linuxlewis/djorm-ext-pgfulltext from djorm_pgfulltext.models import SearchManager from djorm_pgfulltext.fields import VectorField from django.db import models class Page(models.Model): name = models.CharField(max_length=200) description = models.TextField() search_index = VectorField() objects = SearchManager( fields = ('name', 'description'), config = 'pg_catalog.english', # this is default search_field = 'search_index', # this is default auto_update_search_field = True )
  • 23. For search just use search method of the manager https://github.com/linuxlewis/djorm-ext-pgfulltext >>> Page.objects.search("documentation & about") [<Page: Page: Home page>] >>> Page.objects.search("about | documentation | django | home", raw=True) [<Page: Page: Home page>, <Page: Page: About>, <Page: Page: Navigation>]
  • 24. Django 1.10 >>> Entry.objects.filter(body_text__search='recipe') [<Entry: Cheese on Toast recipes>, <Entry: Pizza recipes>] >>> Entry.objects.annotate( ... search=SearchVector('blog__tagline', 'body_text'), ... ).filter(search='cheese') [ <Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>, <Entry: Dairy farming in Argentina>, ] https://github.com/django/django/commit/2d877da
  • 25. Pros and Cons Pros: • Quick implementation • No dependency Cons: • Need manually manage indexes • depend on PostgreSQL • no analytics data • no DSL only `&` and `|` queries • difficult to manage stop words
  • 28. ElasticSearch: Quick Intro Relational DB Databases TablesRows Columns ElasticSearch Indices FieldsTypes Documents
  • 30. ElasticSearch and Python • elasticsearch-py • elasticsearch-dsl-py by Honza Kral • elasticsearch-py-async by Honza Kral
  • 31. ElasticSearch: FTS $ curl -XGET 'http://localhost:9200/ pyconua/talk/_search' -d ' {     "query": {         "match": {             "user": "Andrii"         }     } }'
  • 32. ES: Create Index $ curl -XPUT 'http://localhost:9200/ twitter/' -d '{     "settings" : {         "index" : {             "number_of_shards" : 3,             "number_of_replicas" : 2         }     } }'
  • 33. ES: Add json to Index $ curl -XPUT 'http://localhost:9200/ pyconua/talk/1' -d '{     "user" : "andrii",     "description" : "Full text search" }'
  • 34. ES: Stopwords $ curl -XPUT 'http://localhost:9200/pyconua' -d '{   "settings": {     "analysis": {       "analyzer": {         "my_english": {           "type": "english",           "stopwords_path": "stopwords/english.txt"         }       }     }   } }'
  • 35. ES: Highlight $ curl -XGET 'http://localhost:9200/pyconua/talk/ _search' -d '{     "query" : {...},     "highlight" : {         "pre_tags" : ["<tag1>"],         "post_tags" : ["</tag1>"],         "fields" : {             "_all" : {}         }     } }'
  • 36. ES: Relevance $ curl -XGET 'http://localhost:9200/_search?explain -d ' { "query" : { "match" : { "user" : "andrii" }} }' "_explanation": {   "description": "weight(tweet:honeymoon in 0)                   [PerFieldSimilarity], result of:",   "value": 0.076713204,   "details": [...] }
  • 37. Whoosh • Pure-Python • Whoosh was created by Matt Chaput. • Pluggable scoring algorithm (including BM25F) • more info at video from PyCon US 2013
  • 38. Whoosh: Stop words import os.path import textwrap names = os.listdir("stopwords") for name in names: f = open("stopwords/" + name) wordls = [line.strip() for line in f] words = " ".join(wordls) print '"%s": frozenset(u"""' % name print textwrap.fill(words, 72) print '""".split())' http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/ snowball/stopwords/
  • 39. Whoosh: 
 Highlight results = pycon.search(myquery) for hit in results: print(hit["title"]) # Assume "content" field is stored print(hit.highlights("content"))
  • 40. Whoosh: 
 Ranking search results • Pluggable scoring algorithm • including BM25F
  • 42. Adding search functionality to Simple Model $ cat myapp/models.py from django.db import models from django.contrib.auth.models import User class Page(models.Model): user = models.ForeignKey(User) name = models.CharField(max_length=200) description = models.TextField() def __unicode__(self): return self.name
  • 43. Haystack: Installation $ pip install django-haystack $ cat settings.py INSTALLED_APPS = [ 'django.contrib.admin', 'django.contrib.auth', 'django.contrib.contenttypes', 'django.contrib.sessions', 'django.contrib.sites', # Added. 'haystack', # Then your usual apps... 'blog', ]
  • 44. Haystack: Installation $ pip install elasticsearch $ cat settings.py ... HAYSTACK_CONNECTIONS = { 'default': { 'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine', 'URL': 'http://127.0.0.1:9200/', 'INDEX_NAME': 'haystack', }, } ...
  • 45. Haystack: Creating SearchIndexes $ cat myapp/search_indexes.py import datetime from haystack import indexes from myapp.models import Note class PageIndex(indexes.SearchIndex, indexes.Indexable): text = indexes.CharField(document=True, use_template=True) author = indexes.CharField(model_attr='user') pub_date = indexes.DateTimeField(model_attr='pub_date') def get_model(self): return Note def index_queryset(self, using=None): """Used when the entire index for model is updated.""" return self.get_model().objects. filter(pub_date__lte=datetime.datetime.now())
  • 46. Haystack: SearchQuerySet API from haystack.query import SearchQuerySet from haystack.inputs import Raw all_results = SearchQuerySet().all() hello_results = SearchQuerySet().filter(content='hello') unfriendly_results = SearchQuerySet(). exclude(content=‘hello’). filter(content=‘world’) # To send unescaped data: sqs = SearchQuerySet().filter(title=Raw(trusted_query))
  • 47. Keeping data in sync # Update everything. ./manage.py update_index --settings=settings.prod # Update everything with lots of information about what's going on. ./manage.py update_index --settings=settings.prod --verbosity=2 # Update everything, cleaning up after deleted models. ./manage.py update_index --remove --settings=settings.prod # Update everything changed in the last 2 hours. ./manage.py update_index --age=2 --settings=settings.prod # Update everything between Dec. 1, 2011 & Dec 31, 2011 ./manage.py update_index --start='2011-12-01T00:00:00' --end='2011-12-31T23:59:59' -- settings=settings.prod
  • 48. Signals class RealtimeSignalProcessor(BaseSignalProcessor): """ Allows for observing when saves/deletes fire & automatically updates the search engine appropriately. """ def setup(self): # Naive (listen to all model saves). models.signals.post_save.connect(self.handle_save) models.signals.post_delete.connect(self.handle_delete) # Efficient would be going through all backends & collecting all models # being used, then hooking up signals only for those. def teardown(self): # Naive (listen to all model saves). models.signals.post_save.disconnect(self.handle_save) models.signals.post_delete.disconnect(self.handle_delete) # Efficient would be going through all backends & collecting all models # being used, then disconnecting signals only for those.
  • 49. Haystack: Pros and Cons Pros: • easy to setup • looks like Django ORM but for searches • search engine independent • support 4 engines (Elastic, Solr, Xapian, Whoosh) Cons: • poor SearchQuerySet API • difficult to manage stop words • loose performance, because extra layer • Model - based
  • 50. Results Python 
 clients Python 3 Django
 support elasticsearch-py
 elasticsearch-dsl- py
 elasticsearch-py- async YES haystack +
 elasticstack
 psycopg2 YES djorm-ext- pgfulltext
 django.contrib.po stgres Whoosh YES support using haystack
  • 51. ResultsIndexes Without indexes PUT /index/ No support GIN/GIST to_tsvector() index folder No support
  • 53. ResultsSynonyms Scale YES YES NO SUPPORT I’m not sure NO SUPPORT NO