1. What's the
story with
open
source?
Searching and monitoring news media with open
source technology
Charlie Hull, Flax
BCS IRSG Search Solutions 2010
Photo source: http://www.flickr.com/photos/shironekoeuro/
3. www.flax.co.uk 3
What is Flax?
Search engine specialists
Formed in 2001 from the ashes of Muscat Ltd
and Webtop as Lemur Consulting Ltd
Based in Cambridge UK
Contributors to and users of Xapian
Recently selected as UK Authorized Partner by
Lucid Imagination
Customers include Mydeco, NLA, Durrants
Ltd, Financial Times, MediaMiser, MySkreen
Apache Lucene and Solr are trademarks of The Apache Software Foundation
7. www.flax.co.uk 7
The challenges
Content is created for publication, not for search
Content isn't published consistently or available to all
Ranking is never simple
8. www.flax.co.uk 8
The challenges
Content is created for publication, not for search
Content isn't published consistently or available to all
Ranking is never simple
“We just want something like Google”
9. www.flax.co.uk 9
The challenges
Content is created for publication, not for search
Content isn't published consistently or available to all
Ranking is never simple
“We just want something like Google”
Every system will have to scale beyond its originally
planned size
10. www.flax.co.uk 10
The challenges
Content is created for publication, not for search
Content isn't published consistently or available to all
Ranking is never simple
“We just want something like Google”
Every system will have to scale beyond its originally
planned size
- Every project is different
13. www.flax.co.uk 13
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
14. www.flax.co.uk 14
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
Must cope with high volume, quickly
15. www.flax.co.uk 15
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
Must cope with high volume, quickly
Essential metadata – byline, title, source
16. www.flax.co.uk 16
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
Must cope with high volume, quickly
Essential metadata – byline, title, source
File format translation not always necessary
17. www.flax.co.uk 17
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
Must cope with high volume, quickly
Essential metadata – byline, title, source
File format translation not always necessary
BUT Pre-processing sometimes required
18. www.flax.co.uk 18
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
Must cope with high volume, quickly
Essential metadata – byline, title, source
File format translation not always necessary
BUT Pre-processing sometimes required
Content restriction & embargo data
19. www.flax.co.uk 19
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
Must cope with high volume, quickly
Essential metadata – byline, title, source
File format translation not always necessary
BUT Pre-processing sometimes required
Content restriction & embargo data
Solution
Lightweight, customisable index scripts
using powerful open source libraries
20. www.flax.co.uk 20
So how do we build news search?
import xapian
import flax.core
db = xapian.WritableDatabase('db', xapian.DB_CREATE)
fm = flax.core.Fieldmap()
fm.language = 'en' # stem for English
fm.setfield('mytext', False) # freetext field
fm.setfield('mydate', True) # filter field
fm.save(db)
doc = fm.document()
doc.index('mytext', "I don't like spam.")
doc.index('mydate', datetime(2010, 2, 3, 12, 0))
fm.add_document(db, doc)
db.flush()
23. www.flax.co.uk 23
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
24. www.flax.co.uk 24
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
25. www.flax.co.uk 25
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
Faceted search where appropriate
26. www.flax.co.uk 26
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
Faceted search where appropriate
Saved searches & Alerting
27. www.flax.co.uk 27
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
Faceted search where appropriate
Saved searches & Alerting
'More like this'
28. www.flax.co.uk 28
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
Faceted search where appropriate
Saved searches & Alerting
'More like this'
Content restriction & embargo filters
29. www.flax.co.uk 29
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
Faceted search where appropriate
Saved searches & Alerting
'More like this'
Content restriction & embargo filters
Solution
Template-based user interface scripts,
again using open source libraries
30. www.flax.co.uk 30
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
Faceted search where appropriate
Saved searches & Alerting
'More like this'
Content restriction & embargo filters
Solution
Template-based user interface scripts,
again using open source libraries
Beware Javascript & older browsers!
31. www.flax.co.uk 31
So how do we build news search?
Administration
Indexing failures common
Logging is essential
32. www.flax.co.uk 32
So how do we build news search?
Administration
Indexing failures common
Logging is essential
Log to text as a first pass, reports later
33. www.flax.co.uk 33
So how do we build news search?
Administration
Indexing failures common
Logging is essential
Log to text as a first pass, reports later
Scalability
Content is always growing
Both indexing & searching must scale
34. www.flax.co.uk 34
So how do we build news search?
Administration
Indexing failures common
Logging is essential
Log to text as a first pass, reports later
Scalability
Content is always growing
Both indexing & searching must scale
Open source search libraries provide
distributed indexing, replication, remote
indexes
Not simple to get this right!
35. www.flax.co.uk 35
So how do we build news search?
●Available open source technologies
Languages – C/C++, Java, Python, Javascript
Search libraries – Xapian, Lucene
Search bindings/servers – Xappy, Flax.core,
Solr
External libraries – pyparsing, CherryPy,
xmllib, mxODBC, ...
Presentation & UI – HTMLTemplate, MochiKit,
JQuery, Yahoo! User Interface (YUI), ...
36. www.flax.co.uk 36
So how do we build news search?
●Available open source technologies
Languages – C/C++, Java, Python, Javascript
Search libraries – Xapian, Lucene
Search bindings/servers – Xappy, Flax.core,
Solr
External libraries – pyparsing, CherryPy,
xmllib, mxODBC, ...
Presentation & UI – HTMLTemplate, MochiKit,
JQuery, Yahoo! User Interface (YUI), …
We can use whatever works!
37. www.flax.co.uk 37
Some examples
Newspaper Licensing Agency – NLA Clipshare
20 million newspaper stories
6500 users
Content from every major newspaper (and
most regionals)
Used by journalists, clippings agencies,
media monitors
Replacing internal systems at major
newspapers
http://www.nla-clipshare.com
38. www.flax.co.uk 38
Some examples
Newspaper Licensing Agency – NLA Clipshare
20 million newspaper stories
6500 users
Content from every major newspaper (and
most regionals)
Used by journalists, clippings agencies,
media monitors
Replacing internal systems at major
newspapers
One of very few ways to search content
from all the papers within hours of
publication
http://www.nla-clipshare.com
42. www.flax.co.uk 42
Some examples
Financial Times – press cuttings
Web Service for easy integration
XML source data
Faceted search
Area filters (whole article, body, headline,
byline or any combination)
Synonyms, spelling suggestions
http://presscuttings.ft.com
43. www.flax.co.uk 43
Some examples
Financial Times – press cuttings
Web Service for easy integration
XML source data
Faceted search
Area filters (whole article, body, headline,
byline or any combination)
Synonyms, spelling suggestions
Built from scratch in a fortnight
Designed as a prototype, scaled to
production use without significant change
http://presscuttings.ft.com
46. www.flax.co.uk 46
A different task – news monitoring
Non-traditional use of search
Many automated searches on incoming
content
47. www.flax.co.uk 47
A different task – news monitoring
Non-traditional use of search
Many automated searches on incoming
content
Searches reflect complex client needs
48. www.flax.co.uk 48
A different task – news monitoring
Non-traditional use of search
Many automated searches on incoming
content
Searches reflect complex client needs
False positives require human checking
49. www.flax.co.uk 49
A different task – news monitoring
Non-traditional use of search
Many automated searches on incoming
content
Searches reflect complex client needs
False positives require human checking
False negatives should never occur!
51. www.flax.co.uk 51
A different task – news monitoring
An example
Durrants Ltd.
Thousands of client search profiles
Hundreds of thousands of articles per day
Complex publication heirarchy
Established pipeline
52. www.flax.co.uk 52
A different task – news monitoring
An example
Durrants Ltd.
Thousands of client search profiles
Hundreds of thousands of articles per day
Complex publication heirarchy
Established pipeline
Solution
Flexible query language allows OCR
errors, punctuation, fuzzy matching,
weighting
Supports features of previous engine
Scalable master-slave architecture
53. www.flax.co.uk 53
A different task – news monitoring
An example
Durrants Ltd.
Thousands of client search profiles
Hundreds of thousands of articles per day
Complex publication heirarchy
Established pipeline
Solution
Flexible query language allows OCR
errors, punctuation, fuzzy matching,
weighting
Supports features of previous engine
Scalable master-slave architecture
Accuracy improved in some cases from 95%
rejected to 95% accepted
Hardware budget 15% of previous system
57. www.flax.co.uk 57
Why open source?
Flexible, extendable
Powerful & scalable
Lower cost
Commercial support available as necessary
58. www.flax.co.uk 58
Why open source?
Flexible, extendable
Powerful & scalable
Lower cost
Commercial support available as necessary
- Freedom to innovate
61. www.flax.co.uk 61
Looking to the future
More and more content including social media
Multiple delivery platforms
62. www.flax.co.uk 62
Looking to the future
More and more content including social media
Multiple delivery platforms
Search-powered websites & applications
63. www.flax.co.uk 63
Looking to the future
More and more content including social media
Multiple delivery platforms
Search-powered websites & applications
'No-SQL'
64. www.flax.co.uk 64
Looking to the future
More and more content including social media
Multiple delivery platforms
Search-powered websites & applications
'No-SQL'
Cloud
65. www.flax.co.uk 65
Looking to the future
More and more content including social media
Multiple delivery platforms
Search-powered websites & applications
'No-SQL'
Cloud
Search no longer a bolt-on, but a
platform for innovation
66. www.flax.co.uk 66
Looking to the future
More and more content including social media
Multiple delivery platforms
Search-powered websites & applications
'No-SQL'
Cloud
Search no longer a bolt-on, but a
platform for innovation
Open source no longer an
outsider, but the obvious choice