8. Built-in middlewares
• Downloader middlewares
o DefaultHeadersMiddleware
o HttpAuthMiddleware
o HttpCacheMiddleware
o RedirectMiddleware
o RetryMiddleware
• Spider middlewares
o DepthMiddleware
o RefererMiddleware
• Scheduler middlewares
o DuplicatesFilterMiddleware
9. Extensions
• 特性
o Scrapy启动时加载的普通class
o 监听各种signal (engine_started, item_scraped,
item_dropped)
• Built-in extensions
o CoreStats
o WebConsole
o …
10. 从网页中提取数据
• CrawlSpider: Rule/Matcher/callback
• 使用XPath进行提取
• Scrapy shell
• Parsley: a selector language, superset of XPath and css3 (
内存泄露)
li.main>a/@href
12. Sphinx特性
• high indexing speed (upto 10 MB/sec on modern CPUs);
• high search speed (avg query is under 0.1 sec on 2-4 GB text collections);
• high scalability (upto 100 GB of text, upto 100 M documents on a single
CPU);
• provides good relevance ranking through combination of phrase proximity
ranking and statistical (BM25) ranking;
• provides distributed searching capabilities;
• provides document exceprts generation;
• provides searching from within MySQL through pluggable storage engine;
• supports boolean, phrase, and word proximity queries;
• supports multiple full-text fields per document (upto 32 by default);
• supports multiple additional attributes per document (ie. groups, timestamps,
etc);
• supports stopwords;
• supports both single-byte encodings and UTF-8;
• supports English stemming, Russian stemming, and Soundex for morphology;
• supports MySQL natively (MyISAM and InnoDB tables are both supported);
• supports PostgreSQL natively.
14. 索引
• 数据源: 数据库, xml, 等等。
o 表的每一行视为一篇文档,
o 可在配置中指定哪些列需要进行索引
• 属性:表的某些列可被指定为文档的属性,不被索引,但可
用来做过滤和排序
15. 索引(2)
索引配置的片段
sql_query = SELECT id, title, content,
author_id, forum_id, post_date FROM my_forum_posts
sql_attr_uint = author_id
sql_attr_uint = forum_id
sql_attr_timestamp = post_date
过滤和排序应用示例
// only search posts by author whose ID is 123
$cl->SetFilter ( "author_id", array ( 123 ) );
// only search posts in sub-forums 1, 3 and 7
$cl->SetFilter ( "forum_id", array ( 1,3,7 ) );
// sort found posts by posting date in descending order
$cl->SetSortMode ( SPH_SORT_ATTR_DESC, "post_date" );
16. 搜索 – 匹配模式
匹配模式
o SPH_MATCH_ALL
o SPH_MATCH_ANY
o SPH_MATCH_PHRASE
o SPH_MATCH_BOOLEAN
o SPH_MATCH_EXTENDED2
最灵活的SPH_MATCH_EXTENDED2
hello | world
hello | -world
@name hello @intro world
"hello world"
aaa << bbb << ccc
"hello world foo"~10
"the world is a wonderful place"/3
"hello world" @title "example program"~5 @body python -(php|perl) @* code
19. 搜索 – SphinxQL: 使用sql语法进行搜索
• searchd实现了mysql的网络协议
• 可将searchd当做mysql服务器使用,通过mysql client连接
SELECT *, @weight*10+docboost AS skey FROM example ORDER BY ske
SELECT * FROM test1 WHERE MATCH('"test doc"/3')
SELECT * FROM test WHERE MATCH('@title hello @body world') OPTION
ranker=bm25, max_matches=3000