scrapy+sphinx搭建搜索引擎

scrapy+sphinx搭建搜索引擎

银平 pkufranky@gmail.com
2010-06-07

Outline

• Overview
• Scrapy – python爬虫框架
• Sphinx – C++全文搜索引擎
• demo – scrapy + sphinx实现小说搜索引擎

Overview - 搜索引擎/爬虫分类

• 搜索引擎
o 通用搜索引擎
o 垂直搜索引擎
o 资源型垂直搜索引擎
• 爬虫
o 通用爬虫
o 专用爬虫

Overview - 搜索引擎

• 分词
• 倒排索引
http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-
building-an-inverted-index-1.html

Scrapy – python爬虫框架

• Architecture
• Built-in middlewares
• Extensions
• 从网页中提取数据

Architecture
• Components
o Scrapy Engine
o Scheduler
o Downloader
o Spider
o Item Pipeline
o Middlewares
• Event-driven networking: twisted

Built-in middlewares

• Downloader middlewares
o DefaultHeadersMiddleware
o HttpAuthMiddleware
o HttpCacheMiddleware
o RedirectMiddleware
o RetryMiddleware
• Spider middlewares
o DepthMiddleware
o RefererMiddleware
• Scheduler middlewares
o DuplicatesFilterMiddleware

Extensions

• 特性
o Scrapy启动时加载的普通class
o 监听各种signal (engine_started, item_scraped,
item_dropped)
• Built-in extensions
o CoreStats
o WebConsole
o …

从网页中提取数据

• CrawlSpider: Rule/Matcher/callback
• 使用XPath进行提取
• Scrapy shell
• Parsley: a selector language, superset of XPath and css3 (
内存泄露)
li.main>a/@href

Sphinx – C++全文搜索引擎

• Sphinx特性
• Sphinx组件
• 索引
• 搜索
• SphinxSE: mysql存储引擎

Sphinx特性
• high indexing speed (upto 10 MB/sec on modern CPUs);
• high search speed (avg query is under 0.1 sec on 2-4 GB text collections);
• high scalability (upto 100 GB of text, upto 100 M documents on a single
CPU);
• provides good relevance ranking through combination of phrase proximity
ranking and statistical (BM25) ranking;
• provides distributed searching capabilities;
• provides document exceprts generation;
• provides searching from within MySQL through pluggable storage engine;
• supports boolean, phrase, and word proximity queries;
• supports multiple full-text fields per document (upto 32 by default);
• supports multiple additional attributes per document (ie. groups, timestamps,
etc);
• supports stopwords;
• supports both single-byte encodings and UTF-8;
• supports English stemming, Russian stemming, and Soundex for morphology;
• supports MySQL natively (MyISAM and InnoDB tables are both supported);
• supports PostgreSQL natively.

Sphinx组件

• indexer (binary)
• searchd (binary)
• search (binary)
• sphinxapi (api libraries for PHP, Python, Perl, Ruby)
• spelldump
• indextool

索引

• 数据源: 数据库, xml, 等等。
o 表的每一行视为一篇文档,
o 可在配置中指定哪些列需要进行索引
• 属性：表的某些列可被指定为文档的属性，不被索引，但可
用来做过滤和排序

索引(2)

索引配置的片段

sql_query = SELECT id, title, content,
author_id, forum_id, post_date FROM my_forum_posts
sql_attr_uint = author_id
sql_attr_uint = forum_id
sql_attr_timestamp = post_date

过滤和排序应用示例

// only search posts by author whose ID is 123
$cl->SetFilter ( "author_id", array ( 123 ) );

// only search posts in sub-forums 1, 3 and 7
$cl->SetFilter ( "forum_id", array ( 1,3,7 ) );

// sort found posts by posting date in descending order
$cl->SetSortMode ( SPH_SORT_ATTR_DESC, "post_date" );

搜索 – 匹配模式

匹配模式
o SPH_MATCH_ALL
o SPH_MATCH_ANY
o SPH_MATCH_PHRASE
o SPH_MATCH_BOOLEAN
o SPH_MATCH_EXTENDED2
最灵活的SPH_MATCH_EXTENDED2
hello | world
hello | -world
@name hello @intro world
"hello world"
aaa << bbb << ccc
"hello world foo"~10
"the world is a wonderful place"/3
"hello world" @title "example program"~5 @body python -(php|perl) @* code

搜索 – 排序模式

• SPH_SORT_RELEVANCE
• SPH_SORT_EXTENDED
@weight DESC, price ASC, @id DESC

• SPH_SORT_EXPR
$cl->SetSortMode ( SPH_SORT_EXPR,
"@weight + ( user_karma + ln(pageviews) )*0.1" );

搜索 – 分布式搜索

• 横向划分数据，分别进行索引
• 在主searchd上配置分布式索引
• 主searchd发送请求到各个从searchd，合并返回的结果，并
最终返回
• cluster中的每个searchd都可作为主searchd, 进行负载均衡

搜索 – SphinxQL: 使用sql语法进行搜索

• searchd实现了mysql的网络协议
• 可将searchd当做mysql服务器使用，通过mysql client连接

SELECT *, @weight*10+docboost AS skey FROM example ORDER BY ske
SELECT * FROM test1 WHERE MATCH('"test doc"/3')
SELECT * FROM test WHERE MATCH('@title hello @body world') OPTION
ranker=bm25, max_matches=3000

SphinxSE: mysql存储引擎

特点
• 类似InnoDB, MyISAM, 需要编译进mysql
• 本身不存储数据，而是与searchd通信来获取数据
优点
• 任何语言都可使用，而naive api只支持几种语言
• 当搜索结果需要在mysql端进一步处理时，效率更高 (JOIN,
mysql-like filtering)

Sphinx vs. xapian

Sphinx
• searchd提供搜索服务
• 不用自己实现indexer，不用写C++代码，仅通过配置就能实
现索引和搜索
• 分布式搜索

xapian
• 类似lucene，api直接访问索引文件进行搜索
• 得自己实现indexer
• 可定制性强 (豆瓣从sphinx切到xapian)

demo – scrapy + sphinx实现搜索引擎

以爬取，索引，搜索起点小说为例，实现一个小说搜索引擎.

demo的代码可从github下载:

git clone git://github.com/pkufranky/sedemo-indexer.git
git clone git://github.com/pkufranky/sedemo-spider.git

• 使用scrapy实现爬虫
• 使用sphinx实现索引和搜索
• 实现搜索前端

具体见 http://pkufranky.heroku.com/2010/06/03/scrapysphinx/

scrapy+sphinx搭建搜索引擎

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to scrapy+sphinx搭建搜索引擎

Similar to scrapy+sphinx搭建搜索引擎 (20)

scrapy+sphinx搭建搜索引擎