Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
1. Challenges in
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Challenges in Distributed
Caching
Information Retrieval
Ricardo Baeza-Yates1,2
Joint work with: C. Castillo1 , F. Junqueira1 ,
V. Plachouras1 and F. Silvestri3
1. Yahoo! Research Barcelona – Catalunya, Spain
2. Yahoo! Research Latin America – Santiago, Chile
3. ISTI-CNR – Pisa, Italy
5. Challenges in
Crawling
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
In theory it is simple: fetch, parse, fetch, parse, . . .
6. Challenges in
Crawling
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
In theory it is simple: fetch, parse, fetch, parse, . . .
In practice it is difficult: implies using other people’s
resources (web servers’ CPU and network)
7. Challenges in
Issues
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
How to partition the crawling task?
8. Challenges in
Issues
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
How to partition the crawling task?
What to do when one agent fails?
9. Challenges in
Issues
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
How to partition the crawling task?
What to do when one agent fails?
How to communicate among agents?
10. Challenges in
Issues
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
How to partition the crawling task?
What to do when one agent fails?
How to communicate among agents?
How to deal with external factors?
11. Challenges in
Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Host-based partitioning exploits locality of links
Processing
Caching
Balance improves if large/small hosts are treated
differently
Performance improves if geographic location is considered
12. Challenges in
Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Host-based partitioning exploits locality of links
Processing
Caching
Balance improves if large/small hosts are treated
differently
Performance improves if geographic location is considered
Consistent hashing
Allows to add and remove agents from the
pool [Boldi et al., 2004]
13. Challenges in
Communication
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
Host-based partitioning reduces communication
Highly-linked URLs should be cached
Communication with the server can be improved if server
cooperates
14. Challenges in
External factors
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
DNS can be a bottleneck
Varying quality of implementation of HTTP
Varying quality of HTML coding
Varying quality of service in general
SPAM
16. Challenges in
What’s Indexing
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Indexing in Database and IR is the process of building an
Caching
index over a collection of documents
17. Challenges in
What’s Indexing
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Indexing in Database and IR is the process of building an
Caching
index over a collection of documents
Inverted Indexes are typically used in IR indexes
18. Challenges in
What’s Indexing
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Indexing in Database and IR is the process of building an
Caching
index over a collection of documents
Inverted Indexes are typically used in IR indexes
Lexicon: contains distinct terms appearing in the
collection’s documents
19. Challenges in
What’s Indexing
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Indexing in Database and IR is the process of building an
Caching
index over a collection of documents
Inverted Indexes are typically used in IR indexes
Lexicon: contains distinct terms appearing in the
collection’s documents
Posting Lists: contains descriptions of occurrences of
relative terms within the corresponding documents
20. Challenges in
Index and Distributed Indexing
Distributed IR
Ricardo
Baeza-Yates
D
Crawling
T1
Indexing
Query
Term T2
Processing
Partition
D
Caching
Tn
T
T
Document
Partition
D1 D2 Dm
21. Challenges in
Document Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split the collection into several sub-collections and index
Query
Processing
each one of them separately (corresponding to vertically
Caching
slicing the T × D matrix)
22. Challenges in
Document Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split the collection into several sub-collections and index
Query
Processing
each one of them separately (corresponding to vertically
Caching
slicing the T × D matrix)
pros:
23. Challenges in
Document Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split the collection into several sub-collections and index
Query
Processing
each one of them separately (corresponding to vertically
Caching
slicing the T × D matrix)
pros:
higher throughput
24. Challenges in
Document Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split the collection into several sub-collections and index
Query
Processing
each one of them separately (corresponding to vertically
Caching
slicing the T × D matrix)
pros:
higher throughput
new documents are easily added to existing indexes
25. Challenges in
Document Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split the collection into several sub-collections and index
Query
Processing
each one of them separately (corresponding to vertically
Caching
slicing the T × D matrix)
pros:
higher throughput
new documents are easily added to existing indexes
load balanced
26. Challenges in
Document Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split the collection into several sub-collections and index
Query
Processing
each one of them separately (corresponding to vertically
Caching
slicing the T × D matrix)
pros:
higher throughput
new documents are easily added to existing indexes
load balanced
cons:
27. Challenges in
Document Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split the collection into several sub-collections and index
Query
Processing
each one of them separately (corresponding to vertically
Caching
slicing the T × D matrix)
pros:
higher throughput
new documents are easily added to existing indexes
load balanced
cons:
high number of disk operations
28. Challenges in
Document Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split the collection into several sub-collections and index
Query
Processing
each one of them separately (corresponding to vertically
Caching
slicing the T × D matrix)
pros:
higher throughput
new documents are easily added to existing indexes
load balanced
cons:
high number of disk operations
high volume of data read from disk
29. Challenges in
Term Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split terms of the lexicon (and the corresponding inverted
Query
Processing
lists) among search systems (corresponding to
Caching
horizontally slicing the T × D matrix)
30. Challenges in
Term Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split terms of the lexicon (and the corresponding inverted
Query
Processing
lists) among search systems (corresponding to
Caching
horizontally slicing the T × D matrix)
pros:
31. Challenges in
Term Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split terms of the lexicon (and the corresponding inverted
Query
Processing
lists) among search systems (corresponding to
Caching
horizontally slicing the T × D matrix)
pros:
require the entire index to be built before slicing it into
partitions
32. Challenges in
Term Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split terms of the lexicon (and the corresponding inverted
Query
Processing
lists) among search systems (corresponding to
Caching
horizontally slicing the T × D matrix)
pros:
require the entire index to be built before slicing it into
partitions
not scalable with large collections
33. Challenges in
Term Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split terms of the lexicon (and the corresponding inverted
Query
Processing
lists) among search systems (corresponding to
Caching
horizontally slicing the T × D matrix)
pros:
require the entire index to be built before slicing it into
partitions
not scalable with large collections
cons:
34. Challenges in
Term Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split terms of the lexicon (and the corresponding inverted
Query
Processing
lists) among search systems (corresponding to
Caching
horizontally slicing the T × D matrix)
pros:
require the entire index to be built before slicing it into
partitions
not scalable with large collections
cons:
reduced number of disk accesses
35. Challenges in
Term Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split terms of the lexicon (and the corresponding inverted
Query
Processing
lists) among search systems (corresponding to
Caching
horizontally slicing the T × D matrix)
pros:
require the entire index to be built before slicing it into
partitions
not scalable with large collections
cons:
reduced number of disk accesses
reduced volume of exchanged data
36. Challenges in
Partitioning Goals
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
partitioning is the first design issue to be faced in
distributed indexing
37. Challenges in
Partitioning Goals
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
partitioning is the first design issue to be faced in
distributed indexing
a distributed index should allow for efficient query routing
and resolution
38. Challenges in
Partitioning Goals
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
partitioning is the first design issue to be faced in
distributed indexing
a distributed index should allow for efficient query routing
and resolution
reduction of the number of nodes queried, is desirable too
39. Challenges in
Partitioning Techniques
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
random partitioning
Processing
Caching
40. Challenges in
Partitioning Techniques
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
random partitioning
Processing
documents are assigned u.a.r. to various partitions
Caching
41. Challenges in
Partitioning Techniques
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
random partitioning
Processing
documents are assigned u.a.r. to various partitions
Caching
topical organization using clustering (e.g.
k-means [Larkey et al., 2000, Liu and Croft, 2004])
42. Challenges in
Partitioning Techniques
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
random partitioning
Processing
documents are assigned u.a.r. to various partitions
Caching
topical organization using clustering (e.g.
k-means [Larkey et al., 2000, Liu and Croft, 2004])
documents are firstly clustered and then each partition is
composed by one (or more) cluster(s)
43. Challenges in
Partitioning Techniques
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
random partitioning
Processing
documents are assigned u.a.r. to various partitions
Caching
topical organization using clustering (e.g.
k-means [Larkey et al., 2000, Liu and Croft, 2004])
documents are firstly clustered and then each partition is
composed by one (or more) cluster(s)
usage-induced partitioning (e.g. Query-Vector Document
Model [Puppin et al., 2006])
44. Challenges in
Partitioning Techniques
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
random partitioning
Processing
documents are assigned u.a.r. to various partitions
Caching
topical organization using clustering (e.g.
k-means [Larkey et al., 2000, Liu and Croft, 2004])
documents are firstly clustered and then each partition is
composed by one (or more) cluster(s)
usage-induced partitioning (e.g. Query-Vector Document
Model [Puppin et al., 2006])
clustering is induced by the way users interact with the
index
45. Challenges in
Load Balancing Issues
Distributed IR
Ricardo
Baeza-Yates
In document partitioned indexes not adopting collection
selection strategies, load is almost balanced among all
Crawling
Indexing
the query processors
Query
In term partitioned indexes (even the new pipelined
Processing
schema [Webber et al., 2006]) load balancing is an issue
Caching
In federated document partitioned systems where
collection selection is applied, balancing the load is still
an unexplored issue.
100.0 100.0
80.0 80.0
Load percentage
Load percentage
60.0 60.0
40.0 40.0
20.0 20.0
0.0 0.0
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Document-distributed Pipelined
46. Challenges in
Exploiting Usage Information
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
Query logs contain features that are critical for
optimizing efficiency of different parts of search engines
47. Challenges in
Exploiting Usage Information
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
Query logs contain features that are critical for
optimizing efficiency of different parts of search engines
query distribution
48. Challenges in
Exploiting Usage Information
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
Query logs contain features that are critical for
optimizing efficiency of different parts of search engines
query distribution
query arrival time
49. Challenges in
Exploiting Usage Information
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
Query logs contain features that are critical for
optimizing efficiency of different parts of search engines
query distribution
query arrival time
clickthrough information
50. Challenges in
Exploiting Usage Information
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
Query logs contain features that are critical for
optimizing efficiency of different parts of search engines
query distribution
query arrival time
clickthrough information
...
51. Challenges in
Usage Information in Term Partitioned Systems
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
frequency of query terms can be exploited to partition a
collection with the aim of balancing the load of query
processors
52. Challenges in
Usage Information in Term Partitioned Systems
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
frequency of query terms can be exploited to partition a
collection with the aim of balancing the load of query
processors
bin-packing approach [Moffat et al., 2006]
53. Challenges in
Usage Information in Term Partitioned Systems
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
frequency of query terms can be exploited to partition a
collection with the aim of balancing the load of query
processors
bin-packing approach [Moffat et al., 2006]
data mining approach [Lucchese et al., 2007]
54. Challenges in
Usage Information in Document Partitioned
Distributed IR
Systems
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
random partitioning does not ensure load
Caching
balancing [Badue et al., 2006]
55. Challenges in
Usage Information in Document Partitioned
Distributed IR
Systems
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
random partitioning does not ensure load
Caching
balancing [Badue et al., 2006]
broadcast-based systems perform unnecessary operations
on sub-collections containing few or no relevant
documents
56. Challenges in
Usage Information in Document Partitioned
Distributed IR
Systems
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
random partitioning does not ensure load
Caching
balancing [Badue et al., 2006]
broadcast-based systems perform unnecessary operations
on sub-collections containing few or no relevant
documents
Usage-based mapping can be adopted to partition
sub-collections that can be effectively discriminated upon
query receipt [Puppin et al., 2006]
57. Challenges in
Challenges in Distributed Indexing
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
in document partitioned system it is needed to find
partitioning strategies for enhancing collection selection
performance in terms of effectiveness
58. Challenges in
Challenges in Distributed Indexing
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
in document partitioned system it is needed to find
partitioning strategies for enhancing collection selection
performance in terms of effectiveness
in both systems it is a challenges to find effective load
balancing strategies
59. Challenges in
Query processing
Distributed IR
Ricardo
Baeza-Yates
System components
Crawling
Indexing
Clients submitting queries
Query
Processing
Sites consisting of servers
Caching
Servers are commodity computers
Query processing
System receives a query
Query routing: forwarding query to appropriate sites
Merging results
Challenges
Determine appropriate sites on the fly
WAN communication is costly
60. Challenges in
Challenges in more detail
Distributed IR
Ricardo
Baeza-Yates
Large-scale systems
Crawling
Indexing
Large amount of data
Query
Processing
Large data structures
Caching
Large number of clients and servers
Partitioning of data structures
Necessary due to very large data structures
Parallel processing
e.g. document collection split by topic, language, region
Replication of data structures
For availability, throughput, and response time
Conflict with resource utilization
61. Challenges in
Framework for Distributed Query Processing
Distributed IR
Ricardo
Baeza-Yates
Site B
Region Y
Crawling
Site A
Indexing Region X
Query
Processing
Caching 2
1
Client 3
WAN
Site C
Region Z
Query processor matches documents to the received queries
Coordinator receives queries and routes them to appropriate
sites
Cache stores results from previous queries
62. Challenges in
Currently...
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Multiple sites
Processing
Sites are full replicas of each other
Caching
Simple query routing: Dynamic DNS
According to the previous framework, opportunity to
Use storage resources more efficiently
More sophisticated query routing mechanisms
Effective partition strategies (e.g., language-based strategies)
63. Challenges in
Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Goals
Query
Processing
Achieve cost-effective scalability
Caching
Reduce response times
Potential solutions
Partition of large data structures by topic, language, etc.
Effective query routing first to local sites, then to global sites
Incremental presentation of results to alleviate network
latencies
64. Challenges in
Dependability
Distributed IR
Ricardo
Baeza-Yates
Goals
Crawling
Indexing
Availability of query processors
Query
Processing
Consistency of replicated query data (can be weak)
Caching
Consistency of user state: e.g., personalization, user
preferences
Potetial solutions
More network resources: multi-homed sites
Replication: within and across sites
Consistency: techniques for weak consistency (replicas
eventually converge)
Caching: improve availability when query processors are
unavailable
65. Challenges in
Dependability
Distributed IR
Ricardo
Baeza-Yates Achieving availability is not straighforward
Crawling
BIRN system studied by Junqueira and
Marzullo [Junqueira and Marzullo, 2005]
Indexing
Query
Partitions are quite frequent
Processing
Caching
12
10
Average number of sites
8
6
4
2
0
< 100 < 99.8 < 99 < 98 < 97
Monthly availability
66. Challenges in
Communication
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing Message latency
Query
Communication is costly in wide-area networks
Processing
Caching
Latency is not neglible
Reduced capacity of servers as the latency to process a query
increases
Potential solutions
Reduce as much as possible the number of sites contacted to
process a query
Most queries processed by sites that are close according to
network distance
67. Challenges in
Caching query results or
Distributed IR
postings [Baeza-Yates et al., 2007]
Ricardo
Baeza-Yates
Crawling
Caching query answers:
Indexing
Query
44% of queries are singletons (appear only once)
Processing
Caching
88% of the unique queries are singletons
Infinite cache would achieve 56% hit-ratio
Caching postings of terms:
4% of terms are singletons
73% of the unique terms (the vocabulary) are singletons
Infinite cache would achieve 96% hit-ratio
Note: All statistics and graphs on caching refer to a one-year query
log from yahoo.co.uk
68. Challenges in
Static or dynamic caching of postings
Distributed IR
Ricardo
Baeza-Yates
Crawling
Static caching of postings (Qtf)
Indexing
Cache terms with the highest query log frequency fq (t)
Query
Processing
Caching
However, there is a tradeoff between fq (t) and fd (t)
Terms with high query log frequency fq (t) are good for the
cache
Terms with high document frequency fd (t) occupy too much
space
Static caching of postings as a KnapSack problem (QtfDf)
fq (t)
Cache posting lists of terms with the highest ratio fd (t)
69. Challenges in
Static or dynamic caching of postings
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
70. Challenges in
Analysis of static caching
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Trade-offs between caching postings and answers
Processing
Caching postings results in more hits
Caching
Caching answers is faster
To compare need to consider time/space parameters
Problem: Given a fixed amount of memory and the average
response times for a system, how much to allocate for caching
answers and how much for caching postings?
71. Challenges in
Analysis of static caching
Distributed IR
Ricardo
Baeza-Yates
Crawling Scenario 1: Centralized retrieval system, complete/partial query
evaluation, un/compressed postings
Indexing
Query
Postings cache can answer more queries than answers cache
Processing
Caching
Most available memory for caching postings
Scenario 2: WAN distributed system, complete/partial query
evaluation, un/compressed postings
Network time dominates
Most available memory for caching answers
Query Dynamics
Slowly changing query dynamics makes static caching viable
72. Challenges in
Distributed IR
Ricardo
Badue, C., Baeza-Yates, R., Ribeiro-Neto, B., Ziviani, A., and
Baeza-Yates
Ziviani, N. (2006).
Crawling
Analyzing imbalance among homogeneous index servers in a
Indexing
web search system.
Query
Processing
Information Processing & Management.
Caching
Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V.,
Silvestri, F., and Plachouras, V. (2007).
The impact of caching on search engines.
In Proceedings of the Internation ACM SIGIR Conference (to
appear), Amsterdam, Neatherlands.
Boldi, P., Codenotti, B., Santini, M., and Vigna, S. (2004).
Ubicrawler: a scalable fully distributed web crawler.
Software, Practice and Experience, 34(8):711–726.
73. Challenges in
Distributed IR
Junqueira, F. and Marzullo, K. (2005).
Ricardo
Coterie availability in sites.
Baeza-Yates
In Proceedings of the International Conference on Distributed
Crawling
Computing (DISC), number 3724 in LNCS, pages 3–17,
Indexing
Krakow, Poland. Springer Verlag.
Query
Processing
Larkey, L. S., Connell, M. E., and Callan, J. (2000).
Caching
Collection selection and results merging with topically
organized u.s. patents and trec data.
In CIKM ’00: Proceedings of the ninth international conference
on Information and knowledge management, pages 282–289,
New York, NY, USA. ACM Press.
Liu, X. and Croft, W. B. (2004).
Cluster-based retrieval using language models.
In SIGIR ’04: Proceedings of the 27th annual international
ACM SIGIR conference on Research and development in
information retrieval, pages 186–193, New York, NY, USA.
ACM Press.
74. Challenges in
Distributed IR
Lucchese, C., Orlando, S., Perego, R., and Silvestri, F. (2007).
Ricardo
Baeza-Yates
Mining query logs to optimize index partitioning in parallel web
search engines.
Crawling
To Appear in Proceedings of The 2nd International Conference
Indexing
on Scalable Information Systems (INFOSCALE 2007).
Query
Processing
Caching
Moffat, A., Webber, W., and Zobel, J. (2006).
Load balancing for term-distributed parallel retrieval.
In SIGIR ’06: Proceedings of the 29th annual international
ACM SIGIR conference on Research and development in
information retrieval, pages 348–355, New York, NY, USA.
ACM Press.
Puppin, D., Silvestri, F., and Laforenza, D. (2006).
Query-driven document partitioning and collection selection.
In InfoScale ’06: Proceedings of the 1st international
conference on Scalable information systems, page 34, New
York, NY, USA. ACM Press.
75. Challenges in
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Webber, W., Moffat, A., Zobel, J., and Baeza-Yates, R.
Processing
(2006).
Caching
A pipelined architecture for distributed text query evaluation.
Information Retrieval.
published online October 5, 2006.