SlideShare ist ein Scribd-Unternehmen logo
1 von 75
Downloaden Sie, um offline zu lesen
Challenges in
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

                        Challenges in Distributed
Caching


                         Information Retrieval

                            Ricardo Baeza-Yates1,2
                  Joint work with: C. Castillo1 , F. Junqueira1 ,
                        V. Plachouras1 and F. Silvestri3

                    1. Yahoo! Research Barcelona – Catalunya, Spain
                   2. Yahoo! Research Latin America – Santiago, Chile
                               3. ISTI-CNR – Pisa, Italy
Challenges in
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling
                      Crawling
                  1
Indexing

Query
Processing

Caching
                      Indexing
                  2




                      Query Processing
                  3




                      Caching
                  4
Challenges in
                  Main Modules and Issues
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing
                             Partition         Dependability Communication   External
Query                                                        (sync.)         factors
Processing
                  Crawling   URL assignment    Re-crawl      URL             Web        growth,
Caching
                                                             exchanges       Content change,
                                                                             Network topology,
                                                                             Bandwidth, DNS,
                                                                             QoS of servers
                  Indexing   Doc. partition,   Re-index       Partial        Web        growth,
                             Term partition                   indexing,      Content change,
                                                              updating,      Global statistics
                                                              merging
                  Querying   Query routing,    Replication,   Rank           Changing    user
                             Collection        caching        aggregation,   needs, User base
                             selection, Load                  Personaliza-   growth, DNS
                             balancing                        tion
Challenges in
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing
                    Crawling
                  1
Caching
                    Indexing
                  2
                  3 Query Processing
                  4 Caching
Challenges in
                  Crawling
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching


                      In theory it is simple: fetch, parse, fetch, parse, . . .
Challenges in
                  Crawling
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching


                      In theory it is simple: fetch, parse, fetch, parse, . . .
                      In practice it is difficult: implies using other people’s
                      resources (web servers’ CPU and network)
Challenges in
                  Issues
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching

                      How to partition the crawling task?
Challenges in
                  Issues
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching

                      How to partition the crawling task?
                      What to do when one agent fails?
Challenges in
                  Issues
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching

                      How to partition the crawling task?
                      What to do when one agent fails?
                      How to communicate among agents?
Challenges in
                  Issues
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching

                      How to partition the crawling task?
                      What to do when one agent fails?
                      How to communicate among agents?
                      How to deal with external factors?
Challenges in
                  Partitioning
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
                      Host-based partitioning exploits locality of links
Processing

Caching
                      Balance improves if large/small hosts are treated
                      differently
                      Performance improves if geographic location is considered
Challenges in
                  Partitioning
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
                      Host-based partitioning exploits locality of links
Processing

Caching
                      Balance improves if large/small hosts are treated
                      differently
                      Performance improves if geographic location is considered

                  Consistent hashing
                  Allows to add and remove agents from the
                  pool [Boldi et al., 2004]
Challenges in
                  Communication
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching

                     Host-based partitioning reduces communication
                     Highly-linked URLs should be cached
                     Communication with the server can be improved if server
                     cooperates
Challenges in
                  External factors
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
                      DNS can be a bottleneck
                      Varying quality of implementation of HTTP
                      Varying quality of HTML coding
                      Varying quality of service in general
                      SPAM
Challenges in
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing
                    Crawling
                  1
Caching
                    Indexing
                  2
                  3 Query Processing
                  4 Caching
Challenges in
                  What’s Indexing
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

                      Indexing in Database and IR is the process of building an
Caching

                      index over a collection of documents
Challenges in
                  What’s Indexing
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

                      Indexing in Database and IR is the process of building an
Caching

                      index over a collection of documents
                      Inverted Indexes are typically used in IR indexes
Challenges in
                  What’s Indexing
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

                      Indexing in Database and IR is the process of building an
Caching

                      index over a collection of documents
                      Inverted Indexes are typically used in IR indexes
                          Lexicon: contains distinct terms appearing in the
                          collection’s documents
Challenges in
                  What’s Indexing
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

                      Indexing in Database and IR is the process of building an
Caching

                      index over a collection of documents
                      Inverted Indexes are typically used in IR indexes
                          Lexicon: contains distinct terms appearing in the
                          collection’s documents
                          Posting Lists: contains descriptions of occurrences of
                          relative terms within the corresponding documents
Challenges in
                  Index and Distributed Indexing
 Distributed IR

    Ricardo
  Baeza-Yates
                                                        D
Crawling
                                                                  T1
Indexing

Query
                                      Term                        T2
Processing
                                     Partition
                           D
Caching



                                                                  Tn

                   T




                                                                  T
                                     Document
                                      Partition


                                                   D1   D2   Dm
Challenges in
                  Document Partitioning
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

                      split the collection into several sub-collections and index
Query
Processing
                      each one of them separately (corresponding to vertically
Caching
                      slicing the T × D matrix)
Challenges in
                  Document Partitioning
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

                      split the collection into several sub-collections and index
Query
Processing
                      each one of them separately (corresponding to vertically
Caching
                      slicing the T × D matrix)
                      pros:
Challenges in
                  Document Partitioning
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

                      split the collection into several sub-collections and index
Query
Processing
                      each one of them separately (corresponding to vertically
Caching
                      slicing the T × D matrix)
                      pros:
                           higher throughput
Challenges in
                  Document Partitioning
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

                      split the collection into several sub-collections and index
Query
Processing
                      each one of them separately (corresponding to vertically
Caching
                      slicing the T × D matrix)
                      pros:
                           higher throughput
                           new documents are easily added to existing indexes
Challenges in
                  Document Partitioning
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

                      split the collection into several sub-collections and index
Query
Processing
                      each one of them separately (corresponding to vertically
Caching
                      slicing the T × D matrix)
                      pros:
                           higher throughput
                           new documents are easily added to existing indexes
                           load balanced
Challenges in
                  Document Partitioning
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

                      split the collection into several sub-collections and index
Query
Processing
                      each one of them separately (corresponding to vertically
Caching
                      slicing the T × D matrix)
                      pros:
                           higher throughput
                           new documents are easily added to existing indexes
                           load balanced
                      cons:
Challenges in
                  Document Partitioning
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

                      split the collection into several sub-collections and index
Query
Processing
                      each one of them separately (corresponding to vertically
Caching
                      slicing the T × D matrix)
                      pros:
                           higher throughput
                           new documents are easily added to existing indexes
                           load balanced
                      cons:
                           high number of disk operations
Challenges in
                  Document Partitioning
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

                      split the collection into several sub-collections and index
Query
Processing
                      each one of them separately (corresponding to vertically
Caching
                      slicing the T × D matrix)
                      pros:
                           higher throughput
                           new documents are easily added to existing indexes
                           load balanced
                      cons:
                           high number of disk operations
                           high volume of data read from disk
Challenges in
                  Term Partitioning
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

                      split terms of the lexicon (and the corresponding inverted
Query
Processing
                      lists) among search systems (corresponding to
Caching
                      horizontally slicing the T × D matrix)
Challenges in
                  Term Partitioning
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

                      split terms of the lexicon (and the corresponding inverted
Query
Processing
                      lists) among search systems (corresponding to
Caching
                      horizontally slicing the T × D matrix)
                      pros:
Challenges in
                  Term Partitioning
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

                      split terms of the lexicon (and the corresponding inverted
Query
Processing
                      lists) among search systems (corresponding to
Caching
                      horizontally slicing the T × D matrix)
                      pros:
                          require the entire index to be built before slicing it into
                          partitions
Challenges in
                  Term Partitioning
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

                      split terms of the lexicon (and the corresponding inverted
Query
Processing
                      lists) among search systems (corresponding to
Caching
                      horizontally slicing the T × D matrix)
                      pros:
                          require the entire index to be built before slicing it into
                          partitions
                          not scalable with large collections
Challenges in
                  Term Partitioning
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

                      split terms of the lexicon (and the corresponding inverted
Query
Processing
                      lists) among search systems (corresponding to
Caching
                      horizontally slicing the T × D matrix)
                      pros:
                          require the entire index to be built before slicing it into
                          partitions
                          not scalable with large collections
                      cons:
Challenges in
                  Term Partitioning
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

                      split terms of the lexicon (and the corresponding inverted
Query
Processing
                      lists) among search systems (corresponding to
Caching
                      horizontally slicing the T × D matrix)
                      pros:
                          require the entire index to be built before slicing it into
                          partitions
                          not scalable with large collections
                      cons:
                          reduced number of disk accesses
Challenges in
                  Term Partitioning
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

                      split terms of the lexicon (and the corresponding inverted
Query
Processing
                      lists) among search systems (corresponding to
Caching
                      horizontally slicing the T × D matrix)
                      pros:
                          require the entire index to be built before slicing it into
                          partitions
                          not scalable with large collections
                      cons:
                          reduced number of disk accesses
                          reduced volume of exchanged data
Challenges in
                  Partitioning Goals
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
                      partitioning is the first design issue to be faced in
                      distributed indexing
Challenges in
                  Partitioning Goals
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
                      partitioning is the first design issue to be faced in
                      distributed indexing
                      a distributed index should allow for efficient query routing
                      and resolution
Challenges in
                  Partitioning Goals
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
                      partitioning is the first design issue to be faced in
                      distributed indexing
                      a distributed index should allow for efficient query routing
                      and resolution
                      reduction of the number of nodes queried, is desirable too
Challenges in
                  Partitioning Techniques
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
                      random partitioning
Processing

Caching
Challenges in
                  Partitioning Techniques
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
                      random partitioning
Processing
                          documents are assigned u.a.r. to various partitions
Caching
Challenges in
                  Partitioning Techniques
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
                      random partitioning
Processing
                          documents are assigned u.a.r. to various partitions
Caching

                      topical organization using clustering (e.g.
                      k-means [Larkey et al., 2000, Liu and Croft, 2004])
Challenges in
                  Partitioning Techniques
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
                      random partitioning
Processing
                          documents are assigned u.a.r. to various partitions
Caching

                      topical organization using clustering (e.g.
                      k-means [Larkey et al., 2000, Liu and Croft, 2004])
                          documents are firstly clustered and then each partition is
                          composed by one (or more) cluster(s)
Challenges in
                  Partitioning Techniques
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
                      random partitioning
Processing
                          documents are assigned u.a.r. to various partitions
Caching

                      topical organization using clustering (e.g.
                      k-means [Larkey et al., 2000, Liu and Croft, 2004])
                          documents are firstly clustered and then each partition is
                          composed by one (or more) cluster(s)
                      usage-induced partitioning (e.g. Query-Vector Document
                      Model [Puppin et al., 2006])
Challenges in
                  Partitioning Techniques
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
                      random partitioning
Processing
                          documents are assigned u.a.r. to various partitions
Caching

                      topical organization using clustering (e.g.
                      k-means [Larkey et al., 2000, Liu and Croft, 2004])
                          documents are firstly clustered and then each partition is
                          composed by one (or more) cluster(s)
                      usage-induced partitioning (e.g. Query-Vector Document
                      Model [Puppin et al., 2006])
                          clustering is induced by the way users interact with the
                          index
Challenges in
                  Load Balancing Issues
 Distributed IR

    Ricardo
  Baeza-Yates
                                      In document partitioned indexes not adopting collection
                                      selection strategies, load is almost balanced among all
Crawling

Indexing
                                      the query processors
Query
                                      In term partitioned indexes (even the new pipelined
Processing
                                      schema [Webber et al., 2006]) load balancing is an issue
Caching

                                      In federated document partitioned systems where
                                      collection selection is applied, balancing the load is still
                                      an unexplored issue.
                                      100.0                                                     100.0



                                       80.0                                                      80.0
                    Load percentage




                                                                              Load percentage
                                       60.0                                                      60.0



                                       40.0                                                      40.0



                                       20.0                                                      20.0



                                        0.0                                                       0.0
                                              1   2   3   4   5   6   7   8                             1   2   3    4    5     6   7   8

                                                  Document-distributed                                              Pipelined
Challenges in
                  Exploiting Usage Information
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
                      Query logs contain features that are critical for
                      optimizing efficiency of different parts of search engines
Challenges in
                  Exploiting Usage Information
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
                      Query logs contain features that are critical for
                      optimizing efficiency of different parts of search engines
                          query distribution
Challenges in
                  Exploiting Usage Information
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
                      Query logs contain features that are critical for
                      optimizing efficiency of different parts of search engines
                          query distribution
                          query arrival time
Challenges in
                  Exploiting Usage Information
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
                      Query logs contain features that are critical for
                      optimizing efficiency of different parts of search engines
                          query distribution
                          query arrival time
                          clickthrough information
Challenges in
                  Exploiting Usage Information
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
                      Query logs contain features that are critical for
                      optimizing efficiency of different parts of search engines
                          query distribution
                          query arrival time
                          clickthrough information
                          ...
Challenges in
                  Usage Information in Term Partitioned Systems
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
                      frequency of query terms can be exploited to partition a
                      collection with the aim of balancing the load of query
                      processors
Challenges in
                  Usage Information in Term Partitioned Systems
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
                      frequency of query terms can be exploited to partition a
                      collection with the aim of balancing the load of query
                      processors
                      bin-packing approach [Moffat et al., 2006]
Challenges in
                  Usage Information in Term Partitioned Systems
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
                      frequency of query terms can be exploited to partition a
                      collection with the aim of balancing the load of query
                      processors
                      bin-packing approach [Moffat et al., 2006]
                      data mining approach [Lucchese et al., 2007]
Challenges in
                  Usage Information in Document Partitioned
 Distributed IR

                  Systems
    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

                      random partitioning does not ensure load
Caching

                      balancing [Badue et al., 2006]
Challenges in
                  Usage Information in Document Partitioned
 Distributed IR

                  Systems
    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

                      random partitioning does not ensure load
Caching

                      balancing [Badue et al., 2006]
                      broadcast-based systems perform unnecessary operations
                      on sub-collections containing few or no relevant
                      documents
Challenges in
                  Usage Information in Document Partitioned
 Distributed IR

                  Systems
    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

                      random partitioning does not ensure load
Caching

                      balancing [Badue et al., 2006]
                      broadcast-based systems perform unnecessary operations
                      on sub-collections containing few or no relevant
                      documents
                      Usage-based mapping can be adopted to partition
                      sub-collections that can be effectively discriminated upon
                      query receipt [Puppin et al., 2006]
Challenges in
                  Challenges in Distributed Indexing
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
                      in document partitioned system it is needed to find
                      partitioning strategies for enhancing collection selection
                      performance in terms of effectiveness
Challenges in
                  Challenges in Distributed Indexing
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
                      in document partitioned system it is needed to find
                      partitioning strategies for enhancing collection selection
                      performance in terms of effectiveness
                      in both systems it is a challenges to find effective load
                      balancing strategies
Challenges in
                  Query processing
 Distributed IR

    Ricardo
  Baeza-Yates


                  System components
Crawling

Indexing
                      Clients submitting queries
Query
Processing
                      Sites consisting of servers
Caching
                      Servers are commodity computers
                  Query processing
                      System receives a query
                      Query routing: forwarding query to appropriate sites
                      Merging results
                  Challenges
                      Determine appropriate sites on the fly
                      WAN communication is costly
Challenges in
                  Challenges in more detail
 Distributed IR

    Ricardo
  Baeza-Yates


                  Large-scale systems
Crawling

Indexing
                       Large amount of data
Query
Processing
                       Large data structures
Caching
                       Large number of clients and servers
                  Partitioning of data structures
                       Necessary due to very large data structures
                       Parallel processing
                       e.g. document collection split by topic, language, region
                  Replication of data structures
                       For availability, throughput, and response time
                       Conflict with resource utilization
Challenges in
                  Framework for Distributed Query Processing
 Distributed IR

    Ricardo
  Baeza-Yates
                                                                  Site B
                                                                  Region Y
Crawling
                                Site A
Indexing                        Region X
Query
Processing

Caching                                                   2
                                                      1

                             Client                           3
                                                 WAN




                                           Site C
                                           Region Z



                      Query processor matches documents to the received queries
                      Coordinator receives queries and routes them to appropriate
                      sites
                      Cache stores results from previous queries
Challenges in
                  Currently...
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
                   Multiple sites
Processing

                        Sites are full replicas of each other
Caching


                        Simple query routing: Dynamic DNS
                   According to the previous framework, opportunity to
                        Use storage resources more efficiently
                        More sophisticated query routing mechanisms
                        Effective partition strategies (e.g., language-based strategies)
Challenges in
                  Partitioning
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

                  Goals
Query
Processing
                       Achieve cost-effective scalability
Caching

                       Reduce response times
                  Potential solutions
                       Partition of large data structures by topic, language, etc.
                       Effective query routing first to local sites, then to global sites
                       Incremental presentation of results to alleviate network
                       latencies
Challenges in
                  Dependability
 Distributed IR

    Ricardo
  Baeza-Yates


                  Goals
Crawling

Indexing
                       Availability of query processors
Query
Processing
                       Consistency of replicated query data (can be weak)
Caching
                       Consistency of user state: e.g., personalization, user
                       preferences
                  Potetial solutions
                       More network resources: multi-homed sites
                       Replication: within and across sites
                       Consistency: techniques for weak consistency (replicas
                       eventually converge)
                       Caching: improve availability when query processors are
                       unavailable
Challenges in
                  Dependability
 Distributed IR

    Ricardo
  Baeza-Yates     Achieving availability is not straighforward
Crawling
                       BIRN system studied by Junqueira and
                       Marzullo [Junqueira and Marzullo, 2005]
Indexing

Query
                       Partitions are quite frequent
Processing

Caching

                                                    12


                                                    10
                          Average number of sites




                                                    8


                                                    6


                                                    4


                                                    2


                                                    0
                                                         < 100   < 99.8     < 99          < 98   < 97
                                                                      Monthly availability
Challenges in
                  Communication
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing          Message latency
Query
                       Communication is costly in wide-area networks
Processing

Caching
                       Latency is not neglible
                       Reduced capacity of servers as the latency to process a query
                       increases
                  Potential solutions
                       Reduce as much as possible the number of sites contacted to
                       process a query
                       Most queries processed by sites that are close according to
                       network distance
Challenges in
                  Caching query results or
 Distributed IR

                  postings [Baeza-Yates et al., 2007]
    Ricardo
  Baeza-Yates

Crawling

                  Caching query answers:
Indexing

Query
                       44% of queries are singletons (appear only once)
Processing

Caching
                       88% of the unique queries are singletons
                       Infinite cache would achieve 56% hit-ratio


                  Caching postings of terms:
                       4% of terms are singletons
                       73% of the unique terms (the vocabulary) are singletons
                       Infinite cache would achieve 96% hit-ratio


                  Note: All statistics and graphs on caching refer to a one-year query
                  log from yahoo.co.uk
Challenges in
                  Static or dynamic caching of postings
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling
                  Static caching of postings (Qtf)
Indexing

                       Cache terms with the highest query log frequency fq (t)
Query
Processing

Caching
                  However, there is a tradeoff between fq (t) and fd (t)
                       Terms with high query log frequency fq (t) are good for the
                       cache
                       Terms with high document frequency fd (t) occupy too much
                       space


                  Static caching of postings as a KnapSack problem (QtfDf)
                                                                             fq (t)
                       Cache posting lists of terms with the highest ratio   fd (t)
Challenges in
                  Static or dynamic caching of postings
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
Processing

Caching
Challenges in
                  Analysis of static caching
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
                  Trade-offs between caching postings and answers
Processing

                       Caching postings results in more hits
Caching


                       Caching answers is faster
                       To compare need to consider time/space parameters


                  Problem: Given a fixed amount of memory and the average
                  response times for a system, how much to allocate for caching
                  answers and how much for caching postings?
Challenges in
                  Analysis of static caching
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling          Scenario 1: Centralized retrieval system, complete/partial query
                  evaluation, un/compressed postings
Indexing

Query
                       Postings cache can answer more queries than answers cache
Processing

Caching
                       Most available memory for caching postings


                  Scenario 2: WAN distributed system, complete/partial query
                  evaluation, un/compressed postings
                       Network time dominates
                       Most available memory for caching answers


                  Query Dynamics
                       Slowly changing query dynamics makes static caching viable
Challenges in
 Distributed IR

    Ricardo
                  Badue, C., Baeza-Yates, R., Ribeiro-Neto, B., Ziviani, A., and
  Baeza-Yates

                  Ziviani, N. (2006).
Crawling

                  Analyzing imbalance among homogeneous index servers in a
Indexing
                  web search system.
Query
Processing
                  Information Processing & Management.
Caching

                  Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V.,
                  Silvestri, F., and Plachouras, V. (2007).
                  The impact of caching on search engines.
                  In Proceedings of the Internation ACM SIGIR Conference (to
                  appear), Amsterdam, Neatherlands.

                  Boldi, P., Codenotti, B., Santini, M., and Vigna, S. (2004).
                  Ubicrawler: a scalable fully distributed web crawler.
                  Software, Practice and Experience, 34(8):711–726.
Challenges in
 Distributed IR
                  Junqueira, F. and Marzullo, K. (2005).
    Ricardo
                  Coterie availability in sites.
  Baeza-Yates

                  In Proceedings of the International Conference on Distributed
Crawling
                  Computing (DISC), number 3724 in LNCS, pages 3–17,
Indexing
                  Krakow, Poland. Springer Verlag.
Query
Processing
                  Larkey, L. S., Connell, M. E., and Callan, J. (2000).
Caching

                  Collection selection and results merging with topically
                  organized u.s. patents and trec data.
                  In CIKM ’00: Proceedings of the ninth international conference
                  on Information and knowledge management, pages 282–289,
                  New York, NY, USA. ACM Press.

                  Liu, X. and Croft, W. B. (2004).
                  Cluster-based retrieval using language models.
                  In SIGIR ’04: Proceedings of the 27th annual international
                  ACM SIGIR conference on Research and development in
                  information retrieval, pages 186–193, New York, NY, USA.
                  ACM Press.
Challenges in
 Distributed IR
                  Lucchese, C., Orlando, S., Perego, R., and Silvestri, F. (2007).
    Ricardo
  Baeza-Yates
                  Mining query logs to optimize index partitioning in parallel web
                  search engines.
Crawling

                  To Appear in Proceedings of The 2nd International Conference
Indexing

                  on Scalable Information Systems (INFOSCALE 2007).
Query
Processing

Caching
                  Moffat, A., Webber, W., and Zobel, J. (2006).
                  Load balancing for term-distributed parallel retrieval.
                  In SIGIR ’06: Proceedings of the 29th annual international
                  ACM SIGIR conference on Research and development in
                  information retrieval, pages 348–355, New York, NY, USA.
                  ACM Press.
                  Puppin, D., Silvestri, F., and Laforenza, D. (2006).
                  Query-driven document partitioning and collection selection.
                  In InfoScale ’06: Proceedings of the 1st international
                  conference on Scalable information systems, page 34, New
                  York, NY, USA. ACM Press.
Challenges in
 Distributed IR

    Ricardo
  Baeza-Yates

Crawling

Indexing

Query
                  Webber, W., Moffat, A., Zobel, J., and Baeza-Yates, R.
Processing

                  (2006).
Caching

                  A pipelined architecture for distributed text query evaluation.
                  Information Retrieval.
                  published online October 5, 2006.

Weitere ähnliche Inhalte

Andere mochten auch

Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesSimon Lia-Jonassen
 
Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesYen-Yu Chen
 
Ted Dunning - Whither Hadoop
Ted Dunning - Whither HadoopTed Dunning - Whither Hadoop
Ted Dunning - Whither HadoopEd Kohlwey
 
Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching Laxman Kotte
 
Blogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media BootcampBlogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media Bootcampwesleyzhao
 
Search Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media BootcampSearch Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media Bootcampwesleyzhao
 
Modern Workplace 2016 - Susanna Eerola, Microsoft
Modern Workplace 2016 - Susanna Eerola, MicrosoftModern Workplace 2016 - Susanna Eerola, Microsoft
Modern Workplace 2016 - Susanna Eerola, MicrosoftKnowit Oy
 
Crawling and Indexing
Crawling and IndexingCrawling and Indexing
Crawling and IndexingHimani Tyagi
 
Office 365 - Your Modern Workplace
Office 365 - Your Modern WorkplaceOffice 365 - Your Modern Workplace
Office 365 - Your Modern WorkplaceTarek El Jammal
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...SEO monitor
 
The search engine index
The search engine indexThe search engine index
The search engine indexCJ Jenkins
 
Mis vacaciones
Mis vacacionesMis vacaciones
Mis vacacionesVEGETAL777
 

Andere mochten auch (15)

Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search Engines
 
p723-zukowski
p723-zukowskip723-zukowski
p723-zukowski
 
Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search Engines
 
Modern Workplace: Office 2016
 Modern Workplace: Office 2016 Modern Workplace: Office 2016
Modern Workplace: Office 2016
 
Ted Dunning - Whither Hadoop
Ted Dunning - Whither HadoopTed Dunning - Whither Hadoop
Ted Dunning - Whither Hadoop
 
Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching
 
Blogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media BootcampBlogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media Bootcamp
 
Search Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media BootcampSearch Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media Bootcamp
 
Modern Workplace 2016 - Susanna Eerola, Microsoft
Modern Workplace 2016 - Susanna Eerola, MicrosoftModern Workplace 2016 - Susanna Eerola, Microsoft
Modern Workplace 2016 - Susanna Eerola, Microsoft
 
Crawling and Indexing
Crawling and IndexingCrawling and Indexing
Crawling and Indexing
 
Office 365 - Your Modern Workplace
Office 365 - Your Modern WorkplaceOffice 365 - Your Modern Workplace
Office 365 - Your Modern Workplace
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Mis vacaciones
Mis vacacionesMis vacaciones
Mis vacaciones
 

Ähnlich wie Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Kuali update v4 - mw
Kuali update   v4 - mwKuali update   v4 - mw
Kuali update v4 - mwsarnoa
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
 
20100528 distributedinformationretrieval crestani_lecture01-02
20100528 distributedinformationretrieval crestani_lecture01-0220100528 distributedinformationretrieval crestani_lecture01-02
20100528 distributedinformationretrieval crestani_lecture01-02Computer Science Club
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrDataWorks Summit
 
3 bitriplifiertalk
3 bitriplifiertalk3 bitriplifiertalk
3 bitriplifiertalkJohn Deck
 
SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science Robert H. McDonald
 
Using the LucidWorks REST API to Support User-Configuration Big Data Search E...
Using the LucidWorks REST API to Support User-Configuration Big Data Search E...Using the LucidWorks REST API to Support User-Configuration Big Data Search E...
Using the LucidWorks REST API to Support User-Configuration Big Data Search E...lucenerevolution
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendCaserta
 
Life Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataLife Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataMaori Ito
 
Discovering Computers: Chapter 10
Discovering Computers: Chapter 10Discovering Computers: Chapter 10
Discovering Computers: Chapter 10Anna Stirling
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use CasesDATAVERSITY
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
Soeren okfn greece meetup
Soeren okfn greece meetupSoeren okfn greece meetup
Soeren okfn greece meetupOKFN-GR
 
Davis mark advanced search analytics in 20 minutes
Davis mark   advanced search analytics in 20 minutesDavis mark   advanced search analytics in 20 minutes
Davis mark advanced search analytics in 20 minutesLucidworks (Archived)
 
A Behind the Scenes Look at the Force.com Platform
A Behind the Scenes Look at the Force.com PlatformA Behind the Scenes Look at the Force.com Platform
A Behind the Scenes Look at the Force.com PlatformSalesforce Developers
 
[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email ArchivingJinho Jung
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 

Ähnlich wie Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey) (20)

Kuali update v4 - mw
Kuali update   v4 - mwKuali update   v4 - mw
Kuali update v4 - mw
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
 
20100528 distributedinformationretrieval crestani_lecture01-02
20100528 distributedinformationretrieval crestani_lecture01-0220100528 distributedinformationretrieval crestani_lecture01-02
20100528 distributedinformationretrieval crestani_lecture01-02
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
 
3 bitriplifiertalk
3 bitriplifiertalk3 bitriplifiertalk
3 bitriplifiertalk
 
SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science
 
Using the LucidWorks REST API to Support User-Configuration Big Data Search E...
Using the LucidWorks REST API to Support User-Configuration Big Data Search E...Using the LucidWorks REST API to Support User-Configuration Big Data Search E...
Using the LucidWorks REST API to Support User-Configuration Big Data Search E...
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
 
Life Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataLife Science Database Cross Search and Metadata
Life Science Database Cross Search and Metadata
 
Discovering Computers: Chapter 10
Discovering Computers: Chapter 10Discovering Computers: Chapter 10
Discovering Computers: Chapter 10
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Soeren okfn greece meetup
Soeren okfn greece meetupSoeren okfn greece meetup
Soeren okfn greece meetup
 
Davis mark advanced search analytics in 20 minutes
Davis mark   advanced search analytics in 20 minutesDavis mark   advanced search analytics in 20 minutes
Davis mark advanced search analytics in 20 minutes
 
A Behind the Scenes Look at the Force.com Platform
A Behind the Scenes Look at the Force.com PlatformA Behind the Scenes Look at the Force.com Platform
A Behind the Scenes Look at the Force.com Platform
 
[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Knowledge mobilization
Knowledge mobilization Knowledge mobilization
Knowledge mobilization
 

Mehr von Carlos Castillo (ChaTo)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social MediaCarlos Castillo (ChaTo)
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Carlos Castillo (ChaTo)
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Carlos Castillo (ChaTo)
 

Mehr von Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 

Kürzlich hochgeladen

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Kürzlich hochgeladen (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

  • 1. Challenges in Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Challenges in Distributed Caching Information Retrieval Ricardo Baeza-Yates1,2 Joint work with: C. Castillo1 , F. Junqueira1 , V. Plachouras1 and F. Silvestri3 1. Yahoo! Research Barcelona – Catalunya, Spain 2. Yahoo! Research Latin America – Santiago, Chile 3. ISTI-CNR – Pisa, Italy
  • 2. Challenges in Distributed IR Ricardo Baeza-Yates Crawling Crawling 1 Indexing Query Processing Caching Indexing 2 Query Processing 3 Caching 4
  • 3. Challenges in Main Modules and Issues Distributed IR Ricardo Baeza-Yates Crawling Indexing Partition Dependability Communication External Query (sync.) factors Processing Crawling URL assignment Re-crawl URL Web growth, Caching exchanges Content change, Network topology, Bandwidth, DNS, QoS of servers Indexing Doc. partition, Re-index Partial Web growth, Term partition indexing, Content change, updating, Global statistics merging Querying Query routing, Replication, Rank Changing user Collection caching aggregation, needs, User base selection, Load Personaliza- growth, DNS balancing tion
  • 4. Challenges in Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Crawling 1 Caching Indexing 2 3 Query Processing 4 Caching
  • 5. Challenges in Crawling Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching In theory it is simple: fetch, parse, fetch, parse, . . .
  • 6. Challenges in Crawling Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching In theory it is simple: fetch, parse, fetch, parse, . . . In practice it is difficult: implies using other people’s resources (web servers’ CPU and network)
  • 7. Challenges in Issues Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching How to partition the crawling task?
  • 8. Challenges in Issues Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching How to partition the crawling task? What to do when one agent fails?
  • 9. Challenges in Issues Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching How to partition the crawling task? What to do when one agent fails? How to communicate among agents?
  • 10. Challenges in Issues Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching How to partition the crawling task? What to do when one agent fails? How to communicate among agents? How to deal with external factors?
  • 11. Challenges in Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Host-based partitioning exploits locality of links Processing Caching Balance improves if large/small hosts are treated differently Performance improves if geographic location is considered
  • 12. Challenges in Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Host-based partitioning exploits locality of links Processing Caching Balance improves if large/small hosts are treated differently Performance improves if geographic location is considered Consistent hashing Allows to add and remove agents from the pool [Boldi et al., 2004]
  • 13. Challenges in Communication Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching Host-based partitioning reduces communication Highly-linked URLs should be cached Communication with the server can be improved if server cooperates
  • 14. Challenges in External factors Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching DNS can be a bottleneck Varying quality of implementation of HTTP Varying quality of HTML coding Varying quality of service in general SPAM
  • 15. Challenges in Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Crawling 1 Caching Indexing 2 3 Query Processing 4 Caching
  • 16. Challenges in What’s Indexing Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Indexing in Database and IR is the process of building an Caching index over a collection of documents
  • 17. Challenges in What’s Indexing Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Indexing in Database and IR is the process of building an Caching index over a collection of documents Inverted Indexes are typically used in IR indexes
  • 18. Challenges in What’s Indexing Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Indexing in Database and IR is the process of building an Caching index over a collection of documents Inverted Indexes are typically used in IR indexes Lexicon: contains distinct terms appearing in the collection’s documents
  • 19. Challenges in What’s Indexing Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Indexing in Database and IR is the process of building an Caching index over a collection of documents Inverted Indexes are typically used in IR indexes Lexicon: contains distinct terms appearing in the collection’s documents Posting Lists: contains descriptions of occurrences of relative terms within the corresponding documents
  • 20. Challenges in Index and Distributed Indexing Distributed IR Ricardo Baeza-Yates D Crawling T1 Indexing Query Term T2 Processing Partition D Caching Tn T T Document Partition D1 D2 Dm
  • 21. Challenges in Document Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split the collection into several sub-collections and index Query Processing each one of them separately (corresponding to vertically Caching slicing the T × D matrix)
  • 22. Challenges in Document Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split the collection into several sub-collections and index Query Processing each one of them separately (corresponding to vertically Caching slicing the T × D matrix) pros:
  • 23. Challenges in Document Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split the collection into several sub-collections and index Query Processing each one of them separately (corresponding to vertically Caching slicing the T × D matrix) pros: higher throughput
  • 24. Challenges in Document Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split the collection into several sub-collections and index Query Processing each one of them separately (corresponding to vertically Caching slicing the T × D matrix) pros: higher throughput new documents are easily added to existing indexes
  • 25. Challenges in Document Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split the collection into several sub-collections and index Query Processing each one of them separately (corresponding to vertically Caching slicing the T × D matrix) pros: higher throughput new documents are easily added to existing indexes load balanced
  • 26. Challenges in Document Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split the collection into several sub-collections and index Query Processing each one of them separately (corresponding to vertically Caching slicing the T × D matrix) pros: higher throughput new documents are easily added to existing indexes load balanced cons:
  • 27. Challenges in Document Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split the collection into several sub-collections and index Query Processing each one of them separately (corresponding to vertically Caching slicing the T × D matrix) pros: higher throughput new documents are easily added to existing indexes load balanced cons: high number of disk operations
  • 28. Challenges in Document Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split the collection into several sub-collections and index Query Processing each one of them separately (corresponding to vertically Caching slicing the T × D matrix) pros: higher throughput new documents are easily added to existing indexes load balanced cons: high number of disk operations high volume of data read from disk
  • 29. Challenges in Term Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split terms of the lexicon (and the corresponding inverted Query Processing lists) among search systems (corresponding to Caching horizontally slicing the T × D matrix)
  • 30. Challenges in Term Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split terms of the lexicon (and the corresponding inverted Query Processing lists) among search systems (corresponding to Caching horizontally slicing the T × D matrix) pros:
  • 31. Challenges in Term Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split terms of the lexicon (and the corresponding inverted Query Processing lists) among search systems (corresponding to Caching horizontally slicing the T × D matrix) pros: require the entire index to be built before slicing it into partitions
  • 32. Challenges in Term Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split terms of the lexicon (and the corresponding inverted Query Processing lists) among search systems (corresponding to Caching horizontally slicing the T × D matrix) pros: require the entire index to be built before slicing it into partitions not scalable with large collections
  • 33. Challenges in Term Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split terms of the lexicon (and the corresponding inverted Query Processing lists) among search systems (corresponding to Caching horizontally slicing the T × D matrix) pros: require the entire index to be built before slicing it into partitions not scalable with large collections cons:
  • 34. Challenges in Term Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split terms of the lexicon (and the corresponding inverted Query Processing lists) among search systems (corresponding to Caching horizontally slicing the T × D matrix) pros: require the entire index to be built before slicing it into partitions not scalable with large collections cons: reduced number of disk accesses
  • 35. Challenges in Term Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing split terms of the lexicon (and the corresponding inverted Query Processing lists) among search systems (corresponding to Caching horizontally slicing the T × D matrix) pros: require the entire index to be built before slicing it into partitions not scalable with large collections cons: reduced number of disk accesses reduced volume of exchanged data
  • 36. Challenges in Partitioning Goals Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching partitioning is the first design issue to be faced in distributed indexing
  • 37. Challenges in Partitioning Goals Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching partitioning is the first design issue to be faced in distributed indexing a distributed index should allow for efficient query routing and resolution
  • 38. Challenges in Partitioning Goals Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching partitioning is the first design issue to be faced in distributed indexing a distributed index should allow for efficient query routing and resolution reduction of the number of nodes queried, is desirable too
  • 39. Challenges in Partitioning Techniques Distributed IR Ricardo Baeza-Yates Crawling Indexing Query random partitioning Processing Caching
  • 40. Challenges in Partitioning Techniques Distributed IR Ricardo Baeza-Yates Crawling Indexing Query random partitioning Processing documents are assigned u.a.r. to various partitions Caching
  • 41. Challenges in Partitioning Techniques Distributed IR Ricardo Baeza-Yates Crawling Indexing Query random partitioning Processing documents are assigned u.a.r. to various partitions Caching topical organization using clustering (e.g. k-means [Larkey et al., 2000, Liu and Croft, 2004])
  • 42. Challenges in Partitioning Techniques Distributed IR Ricardo Baeza-Yates Crawling Indexing Query random partitioning Processing documents are assigned u.a.r. to various partitions Caching topical organization using clustering (e.g. k-means [Larkey et al., 2000, Liu and Croft, 2004]) documents are firstly clustered and then each partition is composed by one (or more) cluster(s)
  • 43. Challenges in Partitioning Techniques Distributed IR Ricardo Baeza-Yates Crawling Indexing Query random partitioning Processing documents are assigned u.a.r. to various partitions Caching topical organization using clustering (e.g. k-means [Larkey et al., 2000, Liu and Croft, 2004]) documents are firstly clustered and then each partition is composed by one (or more) cluster(s) usage-induced partitioning (e.g. Query-Vector Document Model [Puppin et al., 2006])
  • 44. Challenges in Partitioning Techniques Distributed IR Ricardo Baeza-Yates Crawling Indexing Query random partitioning Processing documents are assigned u.a.r. to various partitions Caching topical organization using clustering (e.g. k-means [Larkey et al., 2000, Liu and Croft, 2004]) documents are firstly clustered and then each partition is composed by one (or more) cluster(s) usage-induced partitioning (e.g. Query-Vector Document Model [Puppin et al., 2006]) clustering is induced by the way users interact with the index
  • 45. Challenges in Load Balancing Issues Distributed IR Ricardo Baeza-Yates In document partitioned indexes not adopting collection selection strategies, load is almost balanced among all Crawling Indexing the query processors Query In term partitioned indexes (even the new pipelined Processing schema [Webber et al., 2006]) load balancing is an issue Caching In federated document partitioned systems where collection selection is applied, balancing the load is still an unexplored issue. 100.0 100.0 80.0 80.0 Load percentage Load percentage 60.0 60.0 40.0 40.0 20.0 20.0 0.0 0.0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Document-distributed Pipelined
  • 46. Challenges in Exploiting Usage Information Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching Query logs contain features that are critical for optimizing efficiency of different parts of search engines
  • 47. Challenges in Exploiting Usage Information Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching Query logs contain features that are critical for optimizing efficiency of different parts of search engines query distribution
  • 48. Challenges in Exploiting Usage Information Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching Query logs contain features that are critical for optimizing efficiency of different parts of search engines query distribution query arrival time
  • 49. Challenges in Exploiting Usage Information Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching Query logs contain features that are critical for optimizing efficiency of different parts of search engines query distribution query arrival time clickthrough information
  • 50. Challenges in Exploiting Usage Information Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching Query logs contain features that are critical for optimizing efficiency of different parts of search engines query distribution query arrival time clickthrough information ...
  • 51. Challenges in Usage Information in Term Partitioned Systems Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching frequency of query terms can be exploited to partition a collection with the aim of balancing the load of query processors
  • 52. Challenges in Usage Information in Term Partitioned Systems Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching frequency of query terms can be exploited to partition a collection with the aim of balancing the load of query processors bin-packing approach [Moffat et al., 2006]
  • 53. Challenges in Usage Information in Term Partitioned Systems Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching frequency of query terms can be exploited to partition a collection with the aim of balancing the load of query processors bin-packing approach [Moffat et al., 2006] data mining approach [Lucchese et al., 2007]
  • 54. Challenges in Usage Information in Document Partitioned Distributed IR Systems Ricardo Baeza-Yates Crawling Indexing Query Processing random partitioning does not ensure load Caching balancing [Badue et al., 2006]
  • 55. Challenges in Usage Information in Document Partitioned Distributed IR Systems Ricardo Baeza-Yates Crawling Indexing Query Processing random partitioning does not ensure load Caching balancing [Badue et al., 2006] broadcast-based systems perform unnecessary operations on sub-collections containing few or no relevant documents
  • 56. Challenges in Usage Information in Document Partitioned Distributed IR Systems Ricardo Baeza-Yates Crawling Indexing Query Processing random partitioning does not ensure load Caching balancing [Badue et al., 2006] broadcast-based systems perform unnecessary operations on sub-collections containing few or no relevant documents Usage-based mapping can be adopted to partition sub-collections that can be effectively discriminated upon query receipt [Puppin et al., 2006]
  • 57. Challenges in Challenges in Distributed Indexing Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching in document partitioned system it is needed to find partitioning strategies for enhancing collection selection performance in terms of effectiveness
  • 58. Challenges in Challenges in Distributed Indexing Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching in document partitioned system it is needed to find partitioning strategies for enhancing collection selection performance in terms of effectiveness in both systems it is a challenges to find effective load balancing strategies
  • 59. Challenges in Query processing Distributed IR Ricardo Baeza-Yates System components Crawling Indexing Clients submitting queries Query Processing Sites consisting of servers Caching Servers are commodity computers Query processing System receives a query Query routing: forwarding query to appropriate sites Merging results Challenges Determine appropriate sites on the fly WAN communication is costly
  • 60. Challenges in Challenges in more detail Distributed IR Ricardo Baeza-Yates Large-scale systems Crawling Indexing Large amount of data Query Processing Large data structures Caching Large number of clients and servers Partitioning of data structures Necessary due to very large data structures Parallel processing e.g. document collection split by topic, language, region Replication of data structures For availability, throughput, and response time Conflict with resource utilization
  • 61. Challenges in Framework for Distributed Query Processing Distributed IR Ricardo Baeza-Yates Site B Region Y Crawling Site A Indexing Region X Query Processing Caching 2 1 Client 3 WAN Site C Region Z Query processor matches documents to the received queries Coordinator receives queries and routes them to appropriate sites Cache stores results from previous queries
  • 62. Challenges in Currently... Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Multiple sites Processing Sites are full replicas of each other Caching Simple query routing: Dynamic DNS According to the previous framework, opportunity to Use storage resources more efficiently More sophisticated query routing mechanisms Effective partition strategies (e.g., language-based strategies)
  • 63. Challenges in Partitioning Distributed IR Ricardo Baeza-Yates Crawling Indexing Goals Query Processing Achieve cost-effective scalability Caching Reduce response times Potential solutions Partition of large data structures by topic, language, etc. Effective query routing first to local sites, then to global sites Incremental presentation of results to alleviate network latencies
  • 64. Challenges in Dependability Distributed IR Ricardo Baeza-Yates Goals Crawling Indexing Availability of query processors Query Processing Consistency of replicated query data (can be weak) Caching Consistency of user state: e.g., personalization, user preferences Potetial solutions More network resources: multi-homed sites Replication: within and across sites Consistency: techniques for weak consistency (replicas eventually converge) Caching: improve availability when query processors are unavailable
  • 65. Challenges in Dependability Distributed IR Ricardo Baeza-Yates Achieving availability is not straighforward Crawling BIRN system studied by Junqueira and Marzullo [Junqueira and Marzullo, 2005] Indexing Query Partitions are quite frequent Processing Caching 12 10 Average number of sites 8 6 4 2 0 < 100 < 99.8 < 99 < 98 < 97 Monthly availability
  • 66. Challenges in Communication Distributed IR Ricardo Baeza-Yates Crawling Indexing Message latency Query Communication is costly in wide-area networks Processing Caching Latency is not neglible Reduced capacity of servers as the latency to process a query increases Potential solutions Reduce as much as possible the number of sites contacted to process a query Most queries processed by sites that are close according to network distance
  • 67. Challenges in Caching query results or Distributed IR postings [Baeza-Yates et al., 2007] Ricardo Baeza-Yates Crawling Caching query answers: Indexing Query 44% of queries are singletons (appear only once) Processing Caching 88% of the unique queries are singletons Infinite cache would achieve 56% hit-ratio Caching postings of terms: 4% of terms are singletons 73% of the unique terms (the vocabulary) are singletons Infinite cache would achieve 96% hit-ratio Note: All statistics and graphs on caching refer to a one-year query log from yahoo.co.uk
  • 68. Challenges in Static or dynamic caching of postings Distributed IR Ricardo Baeza-Yates Crawling Static caching of postings (Qtf) Indexing Cache terms with the highest query log frequency fq (t) Query Processing Caching However, there is a tradeoff between fq (t) and fd (t) Terms with high query log frequency fq (t) are good for the cache Terms with high document frequency fd (t) occupy too much space Static caching of postings as a KnapSack problem (QtfDf) fq (t) Cache posting lists of terms with the highest ratio fd (t)
  • 69. Challenges in Static or dynamic caching of postings Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching
  • 70. Challenges in Analysis of static caching Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Trade-offs between caching postings and answers Processing Caching postings results in more hits Caching Caching answers is faster To compare need to consider time/space parameters Problem: Given a fixed amount of memory and the average response times for a system, how much to allocate for caching answers and how much for caching postings?
  • 71. Challenges in Analysis of static caching Distributed IR Ricardo Baeza-Yates Crawling Scenario 1: Centralized retrieval system, complete/partial query evaluation, un/compressed postings Indexing Query Postings cache can answer more queries than answers cache Processing Caching Most available memory for caching postings Scenario 2: WAN distributed system, complete/partial query evaluation, un/compressed postings Network time dominates Most available memory for caching answers Query Dynamics Slowly changing query dynamics makes static caching viable
  • 72. Challenges in Distributed IR Ricardo Badue, C., Baeza-Yates, R., Ribeiro-Neto, B., Ziviani, A., and Baeza-Yates Ziviani, N. (2006). Crawling Analyzing imbalance among homogeneous index servers in a Indexing web search system. Query Processing Information Processing & Management. Caching Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Silvestri, F., and Plachouras, V. (2007). The impact of caching on search engines. In Proceedings of the Internation ACM SIGIR Conference (to appear), Amsterdam, Neatherlands. Boldi, P., Codenotti, B., Santini, M., and Vigna, S. (2004). Ubicrawler: a scalable fully distributed web crawler. Software, Practice and Experience, 34(8):711–726.
  • 73. Challenges in Distributed IR Junqueira, F. and Marzullo, K. (2005). Ricardo Coterie availability in sites. Baeza-Yates In Proceedings of the International Conference on Distributed Crawling Computing (DISC), number 3724 in LNCS, pages 3–17, Indexing Krakow, Poland. Springer Verlag. Query Processing Larkey, L. S., Connell, M. E., and Callan, J. (2000). Caching Collection selection and results merging with topically organized u.s. patents and trec data. In CIKM ’00: Proceedings of the ninth international conference on Information and knowledge management, pages 282–289, New York, NY, USA. ACM Press. Liu, X. and Croft, W. B. (2004). Cluster-based retrieval using language models. In SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 186–193, New York, NY, USA. ACM Press.
  • 74. Challenges in Distributed IR Lucchese, C., Orlando, S., Perego, R., and Silvestri, F. (2007). Ricardo Baeza-Yates Mining query logs to optimize index partitioning in parallel web search engines. Crawling To Appear in Proceedings of The 2nd International Conference Indexing on Scalable Information Systems (INFOSCALE 2007). Query Processing Caching Moffat, A., Webber, W., and Zobel, J. (2006). Load balancing for term-distributed parallel retrieval. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 348–355, New York, NY, USA. ACM Press. Puppin, D., Silvestri, F., and Laforenza, D. (2006). Query-driven document partitioning and collection selection. In InfoScale ’06: Proceedings of the 1st international conference on Scalable information systems, page 34, New York, NY, USA. ACM Press.
  • 75. Challenges in Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Webber, W., Moffat, A., Zobel, J., and Baeza-Yates, R. Processing (2006). Caching A pipelined architecture for distributed text query evaluation. Information Retrieval. published online October 5, 2006.