Lecture made at the 19th of April 2012, at the Warsaw University of Technology. This is the 9th lecture in the regular course at master grade studies "Introduction to text mining".
2. SEARCH SYSTEM ARCHITECTURE
& MEASURING SEARCH
EFFECTIVENESS
Introduction to text mining – Warsaw University of Technology
3. Plan
Findwise – who we are, what we do.
General architecture of search engines
Data sources
Content processing
Search index
Query and result processing
Security in search engines
Applications based on search
Leading search technologies
The concept of Findability
Differences in online and enterprise search
Measuring of search effectiveness
Questions and answers
4. Findwise – Search Driven Solutions
• Founded in 2005
• Offices in Sweden, Denmark,
Norway and Poland
• 75+ employees
Our objective is to be a leading provider of Findability solutions utilising
the full potential of search technology to create customer business value.
• Paweł Wróblewski – search enthusiast
7. Data sources
Everything that has an information is a good source!
We need a connector to feed the data into a search system:
Take the content
Take the metadata
Take the security information
Different strategies to feed the data:
Push – external applications invokes search system connector’s API
to feed the content (e.g. transactional systems)
Pull – connector periodically scans the source and takes the data
(e.g. web crawler, file system)
Hybrid – external systems dumps the data which are pulled by a
connector
8. Content Processing – the idea
Format Language Spell Lemmas
Synonyms
Conversion Detection Checking (tenses, forms)
Document
Geography
Taxonomy Custom Companies
Vectorizer Entities
Classification PLUG-IN People
Scopifier index PARIS (Reuters) - Venus Williams raced into the second round of
the $11.25 million French Open Monday, brushing aside
Bianka Lamade, 6-3, 6-3, in 65 minutes.
The Wimbledon and U.S. Open champion, seeded second, breezed
past the German on a blustery center court to become the
first seed to advance at Roland Garros. "I love being here, I
love the French Open and more than anything I'd love to do
well here," the American said.
Input: byte stream
Output: structured document ready to be indexed
9. Content Processing – the implementation
Hydra is used in order to refine content before it hits the index. Every
document fetched from a source runs through a targeted pipeline,
which includes a number of stages. A stage can be considered as an
“app” within Appstore or the Android market. Findwise have created
a huge amount of such stages, where each stage has a small
purpose to enhance the content of the item. It is possible to create
additional stages to serve a specific customer functionality.
10. Hydra - example
Select stages to use in the pipeline, the left column corresponds to the
“market”, and the right is the stages used.
11. Hydra - example
Modify the format of the date to only include year.
12. Hydra - example
The new year meta-data can be used as a facet
13. Hydra - example
Map every author field to a metadata field called author.
Pipeline A
Pipeline B
15. Search index – the problem
Input: structured document (content + metadata)
Output: binary represenation of inverted index optimised for speed
and acuracy
Search index has a flat structure – no internal relations
Usually changes to the index structure require index rebuild (re-
indexing)
16. Search index – the problem
Inverted index
Index split
Theory in previous lectures
M
How to achieve
…
Petabytes of indexed data Indexing / Search
Node 00
Indexing / Search
Node 10
Indexing / Search
Node N0
Thousands of queries M
Index mirror
per second
...
……
Thousands of index Indexing / Search
Node 01
Indexing / Search
Node 11
updates per second
… M
FAST Enterprise …
Search Platform – Indexing / Search
Node 0M
Indexing / Search
Node 1M
Indexing / Search
Node NM
search cluster example
Search Cluster
17. Search index – the implementation
In order to perform effective updates (index rebuilds) several index
partitions are produced
Index
Index
Index
Index
Small partition rebuilds quickly unlike the big one
Rebuild of larger partition involves merging index from smaller
one(s)
Rebuilds can be triggered by: number or rebuild operations, number
of documents, percent of total volume
18. Query processing
Query: Do you have a
Do you have an Spell- Anti-
Tokenizer Phrasing Normalization
LCD monitor checking phrasing
under $900?
Under $900? LCD monitors Flat TV YES!
price < 900 TFT monitors Plasma TV X = LCD monitor
Lemmas
NLQ Thesaurus PLUG-IN BUY( X )
Synonyms
Use “Product” collection
Rank profile = “Profit margin”
Modified query
Geography Adaptive
Evaluation
18
19. Result processing
The following issues might apply to results processing:
Ranking generation
Factors that can be considered: number of hits, proximity of hits,
freshness (date), web measures (e.g. page rank), business and context
factors (boosting or blocking)
Search federation
Integration of results from multiple search engines: round robin,
normalized ranks, searchlets (multiple results lists presented in
different way).
Security trimming
Filtering out the results that do not match user’s credentials
Last second check
20. Security in search solution
Search Application Security
Content-level Security
Secure Server Environment
20
22. Catalogue of Search Based Applications
Intelligence Database Commerce
Corporate Search Media Systems
System Offloading Systems
• Intranets/portals • Market • Data warehouse • Search • Public news
• Information intelligence • Data merchandising syndication
gateways • Customer transformation • Customer • Mulitmedia
• Expertise intelligence • Data caches analytics search
location • Surveillance • Campaign • Proprietary
• ECM • IP protection management research and
repositories • Fraud detection • Call centre publications
• Collaboration • eDiscovery enablement • Libraries
• Knowledge • Quality • Customer self-
Management Management service
• Enterprise apps • Information risk
management
Search subsystem
Data connectors – out of the box, custom made
Repositories – Web, Databases, Files, Enterprise systems
23. Leading search engine technologies
• HP / Autonomy IDOL
• Microsoft (SharePoint and FAST Search products)
• Google Search Appliance (GSA )
• IBM Content Analytics/OmniFind
• Oracle Secure Enterprise Search/Endeca
• Apache Lucene/Solr (Open source)
• Exalead CloudView
• and more…
24. Comparison of different technology vendors
What is the goal of Enterprise Findability
(EF)? Core search
How should EF improve business? technology
What user groups are targeted? Usability Vendor
capabilitie
What does the users’ want and need? s
What information is available and where is it
stored?
How should EF be rolled out and governed? Total cost
What costs are involved? Connectivity of
ownership
and security
Are there any IT strategy considerations?
Vendor mapping provides an answer to which
EF platform matches the overall requirements
best on the short and long term
25. Findability – what is it?
Negligible Business value gained from search technology High
Business (needs & goals)
Users (needs & capabilities)
SEARCH
Search Technology
<simple>
Information (quality & structure)
Organisation (ownership & governance)
Basic Use of search technology/platform
Advanced
– a holistic approach to leverage business value with search
technology
26. Online vs. Enterprise Search
According to Stephen E. Arnold, „The New Landscape of Enterprise
Search”, Pandia, July 2011
27. Online vs. Enterprise Search
According to Stephen E. Arnold, „The New Landscape of Enterprise
Search”, Pandia, July 2011
28. Measuring the search effectiveness
Enterprise case
Relevance of search results is highly subjective
Search is highly bound to business otherwise not important to
consider
Increase income or reduce costs
Take into consideration all the dimensions of Findability:
Business: Needs & Goals
Users: Needs & Capabilities
Information: Quality & Structure
Organization: Ownership & Governance
Search Technology: correctness of implementation
Tools: reviews, workshops, presentations, strategies drafting, audits
etc.
29. Measuring the search effectiveness
Online case
Relevance of search results is highly subjective
Search is highly bound to business otherwise not important to
consider
Increase conversion rate
Verification od search functions and their impact on conversion rate
Make isolated tests per each identified feature
Create a score based on a weighted average
30. the results reported for each single test is composed of the two following elements:
Overall benchmark
Cumulated results for test groups
Measuring the search effectiveness
udit – the Final Report Overall benchmark IPMS
Test categories designed for the purpose of audit are generally applicable to any kind of a search
actively find and filter items in service or solution. Nevertheless some of them are less while some are more important in specific
a map.
g by 3 It is useful feature that aids in finding items closest to
application like online Yellow Pages catalogue. That is why a weight is assigned to each test that
ce
ased
Online case
3
selected position.
represents an importance and influence on the whole YP solution. The defined weights are described
Useful feature enabling mining the neighborhood of
stions selected item. in the following table.
h starting Example as first impression and encouraging users
4 As important
to interact with the service. Test Name Weight Remarks
esult page 3 It is important not to miss any category to offer
opportunity [1-5]
another kind of search, content or advertisements.
I.a Keyword match 5 This is basic feature of any full-text search system and it
h 5 Extremely important factor in online search solutions.
mance mostly influences the overall precision of search.
I.b Wildcard 2 Users of YP solutions rarely uses such features.
expansion
mark score is presented in the following chart.
I.c Accuracy of result 4 The importance of properly assigned categories to
categories registered entries is high since it influences usability and
Overall weighted scores relevance of categories.
6 I.d Query operators 1 Users of YP solutions uses such features hardly ever.
5
I.e Exact phrases 3 It might be important to catch exact phrase in a search
preventing any background processing.
4
II.a
iFind
Lemmatization 5 This is a must-be for any kind of search, especially for
3 Polish language.
PKT
2 II.b Synonym 3 It is useful to improve recall of search thus preventing
PF
1
expansion zero results.
II.c Spellchecking 4 Very useful feature as people tend to make simple
0
spelling mistakes while typing at keyboard.
II.d Anti-phrasing 3 It is useful not to search for irrelevant and meaningless
terms.
alculation the overall benchmark can be expressed as cumulative weighted score
II.e Name and phrase 3 It is useful to capture some multi-word expressions or
es 1-10. The ideal hypothetic search system should achieve score 10.
recognition names as a whole – in single meaning.
re as follows for the conducted tests: II.f Natural Language 2 Vey advanced yet hard to implement feature.
Processing
53
III.a Navigation 4 Very useful feature enabling easy to use and intuitive