1. Measuring the quality of web search engines
Prof. Dr. Dirk Lewandowski
University of Applied Sciences Hamburg
dirk.lewandowski@haw-hamburg.de
Tartu University, 14 September 2009
2. Agenda
Introduction
A few words about user behaviour
Standard retrieval effectiveness tests vs. “Universal Search”
Selected results: Results descriptions, navigational queries
Towards an integrated test framework
Conclusions
1 | Dirk Lewandowski
3. Agenda
Introduction
A few words about user behaviour
Standard retrieval effectiveness tests vs. “Universal Search”
Selected results: Results descriptions, navigational queries
Towards an integrated test framework
Conclusions
2 | Dirk Lewandowski
6. Why measure the quality of web search engines?
• Search engines are the main access point to web content.
• One player is dominating the worldwide market.
• Open questions
– How good are search engines’ results?
– Do we need alternatives to “big three” (“big two”? “big one”?)
– How good are alternative search engines in delivering an alternative view on web
content?
– How good must a new search engine be to compete?
5 | Dirk Lewandowski
7. A framework for measuring search engine quality
• Index quality
– Size of database, coverage of the web
– Coverage of certain areas (countries, languages)
– Index overlap
– Index freshness
• Quality of the results
– Retrieval effectiveness
– User satisfaction
– Results overlap
• Quality of the search features
– Features offered
– Operational reliability
• Search engine usability and user guidance
(Lewandowski & Höchstötter, 2007)
6 | Dirk Lewandowski
8. A framework for measuring search engine quality
• Index quality
– Size of database, coverage of the web
– Coverage of certain areas (countries, languages)
– Index overlap
– Index freshness
• Quality of the results
– Retrieval effectiveness
– User satisfaction
– Results overlap
• Quality of the search features
– Features offered
– Operational reliability
• Search engine usability and user guidance
(Lewandowski & Höchstötter, 2007)
7 | Dirk Lewandowski
9. Agenda
Introduction
A few words about user behaviour
Standard retrieval effectiveness tests vs. “Universal Search”
Selected results: Results descriptions, navigational queries
Towards an integrated test framework
Conclusions
8 | Dirk Lewandowski
10. Users use relatively few cognitive resources in web searching.
• Queries
– Average length: 1.7 words (German language queries; English language queries
slightly longer)
– Approx. 50 percent of queries consist of just one word
• Search engine results pages (SERPs)
– 80 percent of users view no more than the first results page (10 results)
– Users normally only view the first few results („above the fold“)
– Users only view up to five results per session
– Session length is less than 15 minutes
• Users are usually satisfied with the results given.
9 | Dirk Lewandowski
12. Agenda
Introduction
A few words about user behaviour
Standard retrieval effectiveness tests vs. “Universal Search”
Selected results: Results descriptions, navigational queries
Towards an integrated test framework
Conclusions
11 | Dirk Lewandowski
13. Standard design for retrieval effectiveness tests
• Select (at least 50) queries (from log files, from user studies, etc.)
• Select some (major) search engines
• Consider top results (use cut-off)
• Anonymise search engines, randomise results positions
• Let users judge results
• Calculate precision scores
– the ratio of relevant results in proportion to all results retrieved at the
corresponding position
• Calculate/assume recall scores
– the ratio of relevant results shown by a certain search engine in proportion to all
relevant results within the database.
12 | Dirk Lewandowski
15. Standard design for retrieval effectiveness tests
• Problematic assumptions
– Model of “dedicated searcher” (willing to select one result after the other and go
through an extensive list of results)
– User wants high precision and high recall, as well.
• These studies do not consider
– how many documents a user is willing to view / how many are sufficient for
answering the query
– how popular the queries used in the evaluation are
– graded relevance judgements (relevance scales)
– different relevance judgements by different jurors
– different query types
– results descriptions
– users’ typical results selection behaviour
– visibility of different elements in the results lists (through their presentation)
– users’ preference for a certain search engine
– diversity of the results set / the top results
– ...
14 | Dirk Lewandowski
19. Agenda
Introduction
A few words about user behaviour
Standard retrieval effectiveness tests vs. “Universal Search”
Selected results: Results descriptions, navigational queries
Towards an integrated test framework
Conclusions
18 | Dirk Lewandowski
20. Results descriptions
META Description
Yahoo Directory
Open Directory
19 | Dirk Lewandowski
28. Search engines deal with different query types.
Query types (Broder, 2002):
• Informational
– Looking for information on a certain topic
– User wants to view a few relevant pages
• Navigational
– Looking for a (known) homepage
– User wants to navigate to this homepage, only one relevant result
• Transactional
– Looking for a website to complete a transaction
– One or more relevant results
– Transaction can be purchasing a product, downloading a file, etc.
27 | Dirk Lewandowski
29. Search engines deal with different query types.
Query types (Broder, 2002):
• Informational
– Looking for information on a certain topic
– User wants to view a few relevant pages
• Navigational
– Looking for a (known) homepage
– User wants to navigate to this homepage, only one relevant result
• Transactional
– Looking for a website to complete a transaction
– One or more relevant results
– Transaction can be purchasing a product, downloading a file, etc.
28 | Dirk Lewandowski
32. Results for navigational vs. informational queries
• Studies should consider informational, as well as navigational queries.
• Queries should be weighted according to their frequency.
• When >40% of queries are navigational, new search engines should put
significant effort in answering these queries sufficiently.
31 | Dirk Lewandowski
33. Agenda
Introduction
A few words about user behaviour
Standard retrieval effectiveness tests vs. “Universal Search”
Selected results: Results descriptions, navigational queries
Towards an integrated test framework
Conclusions
32 | Dirk Lewandowski
34. Addressing major problems with retrieval effectiveness tests
• We use navigational and informational queries, as well.
– no suitable framework for transactional queries, though.
• We use query frequency data from the T-Online database.
– The database consists of approx. 400 million queries from 2007 onwards.
– We can use time series analysis.
• We classify queries according to query type and topic.
– We did a study on query classification based on 50,000 queries from T-Online log
files to gain a better understanding of user intents. Data collection was
“crowdsourced” to Humangrid GmbH.
33 | Dirk Lewandowski
35. Addressing major problems with retrieval effectiveness tests
• We consider all elements on the first results page.
– Organic results, ads, shortcuts
– We will use clickthrough data from T-Online to measure “importance” of certain
results.
• Each result will be judged by several jurors.
– Juror groups: Students, professors, retired persons, librarians, school children,
other.
– Additional judgements by the “general users” are collected in cooperation with
Humangrid GmbH.
• Results will be graded on a relevance scale.
– Results and descriptions will be getting judged.
• We will classify all organic results according to
– document type (e.g., encyclopaedia, blog, forum, news)
– date
– degree of commercial intent
34 | Dirk Lewandowski
36. Addressing major problems with retrieval effectiveness tests
• We will count ads on results pages
– Do search engines prefer pages carrying ads from the engine’s ad system?
• We will ask users additional questions
– Users will also judge the results set of each individual search engine as a whole.
– Users will rank search engine based on the result sets.
– Users will say where they would have stopped viewing more results.
– Users will provide their own individual relevance-ranked list by card-sorting the
complete results set from all search engines.
• We will use printout screenshots of the results
– Makes the study “mobile”
– Especially important when considering certain user groups (e.g., elderly people).
35 | Dirk Lewandowski
37. State of current work
• First wave of data collection starting in October.
• Proposal for additional project funding sent to DFG (German Research
Foundation).
• Project on user intents from search queries near completion.
• Continuing collaboration with Deutsche Telekom, T-Online.
36 | Dirk Lewandowski
38. Agenda
Introduction
A few words about user behaviour
Standard retrieval effectiveness tests vs. “Universal Search”
Selected results: Results descriptions, navigational queries
Towards an integrated test framework
Conclusions
37 | Dirk Lewandowski
39. Conclusion
• Measuring search engine quality is a complex task.
• Retrieval effectiveness is a major aspect of SE quality evaluation.
• Established evaluation frameworks are not sufficient for the web context.
38 | Dirk Lewandowski
40. Thank you for your attention.
Prof. Dr.
Dirk Lewandowski
Hamburg University of Applied Sciences
Department Information
Berliner Tor 5
D - 20099 Hamburg
Germany
www.bui.haw-hamburg.de/lewandowski.html
E-Mail: dirk.lewandowski@haw-hamburg.de