SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Caching 
Search Engine Results 
over 
Incremental Indices 
Y! Research Barcelona Y! Labs Haifa 
Roi Blanco Edward Bortnikov 
Flavio Junqueira Ronny Lempel 
Luca Telloli * 
Hugo Zaragoza 
•* currently at Barcelona Supercomputing Center
- 2 - 
Overview 
 Caching (Background & Prior Art) 
 Cache Invalidation Predictors 
 Experimental Setup 
 Results 
 Conclusions and Future Work
CHIigPh A Lrcehvietle Actrucrheitecture of Search Engines 
Cache Query 
results 
Runtime system 
Parser/ 
Tokenizer 
- 3 - 
Index 
terms 
Engine 
queries 
Indexing pipeline 
W WWWWW
Web Search Results Caching 
Caching of Web Search Results is crucial: 
• Query stream is extremely REDUNDANT and BURSTY 
– Zipfian distribution of query popularity (redundant) 
– Extreme trending of topics (bursty) 
• CACHE {q}  Search_Results(q) 
- 4 - 
• Caching benefits : 
– Shorten the engine’s response time (user waiting) 
– Lower the number/cost of query executions (# data centers) 
• Caveat: data (pages) is constantly changing!
Caching Search Engine Results – Prior Art 
• Markatos, 2001: applied classical replacement policies (LRU, 
SLRU) to a 1M query log from Excite; demonstrated hit rates of 
~30% 
• Replacement policies tailored for search engines: 
– PDC: Probability Driven Cache (Lempel & Moran, 2003) 
– SDC: Static/Dynamic Cache (Silvestri, Fagni, Orlando, Palmerini and 
Perego, 2003) 
– AC: Admission-based Caching (Baeza-Yates, Junqueira, Plachouras 
and Witschel, 2007) 
• Other observations and approaches: 
– Lempel & Moran, 2004: theoretical study via competitive analysis 
– Gan & Suel, 2009: optimizing query evaluation work rather than hit 
rates 
– Cambazoglu, Junqueira, Plachouras, Banachowski, Cui, Lim, and 
Bridge, 2010: refreshing aged entries during times of low back-end 
load 
- 5 -
- 6 - 
Traditional View: 
 Dilemma: Freshness versus Computation 
 Extreme #1: do not cache at all – evaluate all queries 
 100% fresh results, lots of redundant evaluations 
 Extreme #2: never invalidate the cache 
 A majority of stale results – results refreshed only due to cache 
replacement, no redundant work 
 Middle ground: invalidate periodically (TTL) 
 A time-to-live parameter is applied to each cached entry
Caching in the Presence of Index Changes 
• Increasing importance of freshness in search: 
– News, Blogs, Twitter, Social, Reviews, Local… 
• Moving towards “Real Time Crawling”: 
– Latency measured in seconds instead of hours or days. 
• Caching, by definition, returns OLD results 
– Traditionally as TTL  0, caching hit rate  0 
• Can we have our cake and eat it too? 
– Can a cache operate on a very-fast changing collection? 
- 7 -
Cache Invalidation Predictors 
- 8 - 
 Main idea: 
 Not ALL the documents change ALL the time. 
 When a document changes (or is created / deleted) 
remove from the cache any quires that may have returned it. 
 e.g: a new document on Spanish cooking arrives; 
no need to invalidate queries about quantum physics! 
 Cache Invalidator Predictors (CIP): 
1. Capture document insertions/updates as they enter the index 
2. Using document features, predict which cached search results will be 
affected by the updates 
3. Invalidate those cached entries (equivalent to eviction) 
4. Upon document deletion, invalidate any cached entry containing that 
document
Cache Invalidation Predictor Architecture 
- 9 - 
CIP Architecture 
Legend 
Runtime system 
Parser/ 
Tokenizer 
Index 
terms 
Cache Query 
Engine 
CIP 
Synopsis 
generator 
queries 
Indexing pipeline 
Data flow 
API calls
The Invalidator: Brief Implementation Notes 
• The CIP needs to quickly locate, 
given a synopsis (e.g. document), 
which cached entries it matches 
• Essentially a reversed search engine: 
– The synopses are the queries 
– The queries (whose results are cached) are the documents 
• (!) Non-negligible cost of communication, indexing and 
querying. 
- 10 -
- 11 - 
Some Definitions 
• At any given time, a query in the cache may be: 
– Stale: cache entry no longer represents the results the engine 
would return for q 
– or not stale. 
• Stale Rate: proportion of queries for which the search 
engine returns stale results 
• (Both computable, given enough computing time!)
- 12 - 
Some Definitions (2) 
 At any given time a CIP may invalidate a query or not: 
 True Positive: invalidation of stale query  - 
 True Negative: non-invalidation of non-stale query  ! 
 False Positive: invalidation of non-stale query  $ 
 False Negative: non-invalidation of stale-query  ! 
 False Negatives are much more expensive than False Positives: 
 User dissatisfaction vs. computational time 
 Frequency of query! 
 Error spread forward in time! 
 True Negatives lead to huge savings (x query volume)
Invalidation Policies – Upon Match 
 Upon match: invalidate query q whenever the synopsis of 
document d matches q 
 E.g., for conjunctive queries, q Í d 
 e.g. for a BOW engine, stale rate=0 
- 13 - 
The Boston Celtics beat the L.A. 
Lakers on their home court in the 
4th game of the 2010 NBA Finals 
 Very low stale rate  
 High FP! $$ 
URL1 0.875 
URL2 0.834 
… 
URL9 0.692 
URL10 0.511 
URL1 0.924 
URL2 0.876 
… 
URL9 0.769 
URL10 0.631 
URL1 0.899 
URL2 0.867 
… 
URL9 0.741 
URL10 0.651 
URL1 0712 
URL2 0.690 
… 
URL9 0.482 
URL10 0.375 
Oil Spill 
URL1 0.905 
URL2 0.704 
… 
URL9 0.662 
URL10 0.583 
home 
URL1 0.999 
URL2 0.888 
… 
URL9 0.222 
URL10 0.111 
Boston Celtics Barack Obama World Cup L.A. Lakers
Invalidation Policies – Score Thresholding 
 Score Thresholding: invalidate q whenever projected score(q,d) is 
high enough (prerequisite: q matches d) 
 Requires maintaining the min score per result set, and the 
stand-alone ability to compute the score of a synopsis w/respect 
to any query 
 Reduces FP’s, increases FN’s President Barack Obama criticized 
Score=0.503 Score=0.681 
- 14 - 
BP yesterday for mishandling the 
oil spill in the gulf of Mexico 
URL1 0.875 
URL2 0.834 
… 
URL9 0.692 
URL10 0.511 
URL1 0.924 
URL2 0.876 
… 
URL9 0.769 
URL10 0.631 
URL1 0.899 
URL2 0.867 
… 
URL9 0.741 
URL10 0.651 
URL1 0712 
URL2 0.690 
… 
URL9 0.482 
URL10 0.375 
Oil Spill 
URL1 0.905 
URL2 0.704 
… 
URL9 0.662 
URL10 0.583 
home 
URL1 0.999 
URL2 0.888 
… 
URL9 0.222 
URL10 0.111 
Boston Celtics Barack Obama World Cup L.A. Lakers
CIP Policies – Synopsis Generation 
 Full synopsis: entire document + all ranking attributes 
 Idea: reduce synopsis by dropping stuff “unlikely” to affect scoring 
 Less communication  but more prediction errors  
 In this paper: 
 transfer some fraction of top TF-IDF terms 
 drop document revisions that didn’t “change much” 
- 15 - 
We the people of 
the United States, 
in Order to form a 
more perfect Union, 
. . . 
Order 
People 
Perfect 
union
Experimental Setting #1 
 Sandbox experiment – static cache containing fixed query set, 
controlled document/query dynamics (no interleaving) 
 Data Source: en.wikipedia.org 
 History span: 2006 – 2008 
 2.8 TB, > 1M pages 
 Dominated by updates (> 95%) 
- 16 - 
 Query Source: Y! query log 
 2 days of queries with a click on Wikipedia (2.5 M) 
 Sample of 10K queries (9234 unique) chosen u.a.r. 
 Evaluation pattern: 
 120 single-day epochs (~4% change/day) 
 The same 10K query batch at the end of each epoch 
 Search library: Apache Lucene open-source library
CIP Parameters – Notation Summary 
η Fraction of top-terms in synopsis 0 … 1 
δ Revision modification threshold 0 … 1 
1s Score thresholding applied? 0/1 
τ Time-to-live (TTL) threshold 0 .. ∞ 
Basic CIP: η = 1, δ = 0, τ = ∞, 1s = 0 
- 17 -
- 18 - 
Baseline Comparison 
Policy False 
Positives 
False 
Negatives 
Stale Rate 
No invalidation 
(TTL τ=∞) 
0 0.108 0.768 (!) 
No caching 
(TTL τ=0) 
0.892 0 0 
TTL τ=2 0.446 0.054 0.055 
TTL τ=5 0.179 0.086 0.175 
Basic CIP 
(Full synopses, 
invalidate upon match, 
threshold=no, τ=∞) 
0.679 0.001 0.008 (!)
CIP Effectiveness: varying 1s, τ, and η 
- 19 - 
Shrinking synopsis 
Growing TTL
CIP Effectiveness: varying 1s and δ 
- 20 - 
Increasing 
revision 
threshold
- 21 - 
Best-in-Class Picture 
??
- 22 - 
Conclusions 
 The problem of maintaining cached search results over 
incremental indexes is real, and under-explored 
 We proposed the CIP framework for real-time search cache 
management 
 We proposed an experimental setting for CIPs 
 Demonstrated a simple CIP that significantly improves over prior 
art (TTL), and measured sensitivity to various parameters
- 23 - 
Future Work 
• Analyze a real-world scenario (News) 
– More drastic update and query dynamics 
– More realistic implementation to measure cost overhead 
– Compare to dynamic TTL 
• Continue Improving CIPs 
– Better synopsis 
– Connections between corpus dynamics and query dynamics 
• Study relation between real-time caching with CIPs and 
pre-fetching of results
Thank you! Questions? 
- 24 -
Policy Stability: Curbing Stale Results 
- 25 - 
Still growing 
stable 
Still growing but slowly stable

Weitere ähnliche Inhalte

Andere mochten auch

MY PROFILE
MY PROFILEMY PROFILE
MY PROFILE
Selva Rajan
 
All YWCA Docs
All YWCA DocsAll YWCA Docs
All YWCA Docs
jmingma
 
G48 53011810075
G48 53011810075G48 53011810075
G48 53011810075
BenjamasS
 
Pres
PresPres
Pres
Andrey L
 
The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...
The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...
The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...
william.m.thomson
 
Reported statements
Reported statementsReported statements
Reported statements
Vicky
 
Destination pluto
Destination plutoDestination pluto
Destination pluto
Lisa Baird
 
Motion review
Motion reviewMotion review
Motion review
mshenry
 
Recent Developments in Aviation Law
Recent Developments in Aviation LawRecent Developments in Aviation Law
Recent Developments in Aviation Law
Stites & Harbison
 
Web 2.0..Business Friend or Foe?
Web 2.0..Business Friend or Foe?Web 2.0..Business Friend or Foe?
Web 2.0..Business Friend or Foe?
Stites & Harbison
 
Harry potter and the deathly hallows review.
Harry potter and the deathly hallows review.Harry potter and the deathly hallows review.
Harry potter and the deathly hallows review.
Becca McPartland
 

Andere mochten auch (20)

MY PROFILE
MY PROFILEMY PROFILE
MY PROFILE
 
Guiding conservation and sustainable use through a national Prunus africana M...
Guiding conservation and sustainable use through a national Prunus africana M...Guiding conservation and sustainable use through a national Prunus africana M...
Guiding conservation and sustainable use through a national Prunus africana M...
 
Exploring type-directed, test-driven development: a case study using FizzBuzz
Exploring type-directed, test-driven development: a case study using FizzBuzzExploring type-directed, test-driven development: a case study using FizzBuzz
Exploring type-directed, test-driven development: a case study using FizzBuzz
 
All YWCA Docs
All YWCA DocsAll YWCA Docs
All YWCA Docs
 
G48 53011810075
G48 53011810075G48 53011810075
G48 53011810075
 
Pres
PresPres
Pres
 
The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...
The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...
The Fight for Marjah - Recent Counterinsurgency Operations In Southern Afghan...
 
Networking 101
Networking 101Networking 101
Networking 101
 
Reported statements
Reported statementsReported statements
Reported statements
 
Destination pluto
Destination plutoDestination pluto
Destination pluto
 
Motion review
Motion reviewMotion review
Motion review
 
#ForoEGovAR | Casos de PSC y su adaptaciĂłn
 #ForoEGovAR | Casos de PSC y su adaptaciĂłn #ForoEGovAR | Casos de PSC y su adaptaciĂłn
#ForoEGovAR | Casos de PSC y su adaptaciĂłn
 
Gic2011 aula10-ingles
Gic2011 aula10-inglesGic2011 aula10-ingles
Gic2011 aula10-ingles
 
Strategic research agenda for cocoa coffee Wageningen UR 09062014
Strategic research agenda for cocoa coffee Wageningen UR 09062014Strategic research agenda for cocoa coffee Wageningen UR 09062014
Strategic research agenda for cocoa coffee Wageningen UR 09062014
 
Recent Developments in Aviation Law
Recent Developments in Aviation LawRecent Developments in Aviation Law
Recent Developments in Aviation Law
 
Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...
Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...
Prunus africana “No chop um, no kill um, but keep um”: From an endangered spe...
 
Presentation1
Presentation1Presentation1
Presentation1
 
Web 2.0..Business Friend or Foe?
Web 2.0..Business Friend or Foe?Web 2.0..Business Friend or Foe?
Web 2.0..Business Friend or Foe?
 
Harry potter and the deathly hallows review.
Harry potter and the deathly hallows review.Harry potter and the deathly hallows review.
Harry potter and the deathly hallows review.
 
Halifax’s Finance and Insurance Industry: Our Opportunity
Halifax’s Finance and Insurance Industry: Our OpportunityHalifax’s Finance and Insurance Industry: Our Opportunity
Halifax’s Finance and Insurance Industry: Our Opportunity
 

Ähnlich wie Caching Search Engine Results over Incremental Indices

System and User Aspects of Web Search Latency
System and User Aspects of Web Search LatencySystem and User Aspects of Web Search Latency
System and User Aspects of Web Search Latency
Telefonica Research
 
Swift design session - public object storage scalability
Swift design session  - public object storage scalabilitySwift design session  - public object storage scalability
Swift design session - public object storage scalability
Alan Jiang
 

Ähnlich wie Caching Search Engine Results over Incremental Indices (20)

ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...
ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...
ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoin...
 
Scylla Summit 2018: OLAP or OLTP? Why Not Both?
Scylla Summit 2018: OLAP or OLTP? Why Not Both?Scylla Summit 2018: OLAP or OLTP? Why Not Both?
Scylla Summit 2018: OLAP or OLTP? Why Not Both?
 
System and User Aspects of Web Search Latency
System and User Aspects of Web Search LatencySystem and User Aspects of Web Search Latency
System and User Aspects of Web Search Latency
 
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
 
Splunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search DojoSplunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search Dojo
 
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...
 
Oracle DB In-Memory technologie v kombinaci s procesorem M7
Oracle DB In-Memory technologie v kombinaci s procesorem M7Oracle DB In-Memory technologie v kombinaci s procesorem M7
Oracle DB In-Memory technologie v kombinaci s procesorem M7
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query set
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico Jacobs
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Swift design session - public object storage scalability
Swift design session  - public object storage scalabilitySwift design session  - public object storage scalability
Swift design session - public object storage scalability
 
Real World Performance - Data Warehouses
Real World Performance - Data WarehousesReal World Performance - Data Warehouses
Real World Performance - Data Warehouses
 
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at Yahoo
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
 

Mehr von Roi Blanco

Mehr von Roi Blanco (14)

From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the Web
 
ďżźEntity Linking via Graph-Distance Minimization
ďżźEntity Linking via Graph-Distance MinimizationďżźEntity Linking via Graph-Distance Minimization
ďżźEntity Linking via Graph-Distance Minimization
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and future
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations
 
Keyword Search over RDF Graphs
Keyword Search over RDF GraphsKeyword Search over RDF Graphs
Keyword Search over RDF Graphs
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic Search
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operators
 
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesEnergy-Price-Driven Query Processing in Multi-center WebSearch Engines
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF data
 
Finding support sentences for entities
Finding support sentences for entitiesFinding support sentences for entities
Finding support sentences for entities
 

KĂźrzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

KĂźrzlich hochgeladen (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Caching Search Engine Results over Incremental Indices

  • 1. Caching Search Engine Results over Incremental Indices Y! Research Barcelona Y! Labs Haifa Roi Blanco Edward Bortnikov Flavio Junqueira Ronny Lempel Luca Telloli * Hugo Zaragoza •* currently at Barcelona Supercomputing Center
  • 2. - 2 - Overview  Caching (Background & Prior Art)  Cache Invalidation Predictors  Experimental Setup  Results  Conclusions and Future Work
  • 3. CHIigPh A Lrcehvietle Actrucrheitecture of Search Engines Cache Query results Runtime system Parser/ Tokenizer - 3 - Index terms Engine queries Indexing pipeline W WWWWW
  • 4. Web Search Results Caching Caching of Web Search Results is crucial: • Query stream is extremely REDUNDANT and BURSTY – Zipfian distribution of query popularity (redundant) – Extreme trending of topics (bursty) • CACHE {q}  Search_Results(q) - 4 - • Caching benefits : – Shorten the engine’s response time (user waiting) – Lower the number/cost of query executions (# data centers) • Caveat: data (pages) is constantly changing!
  • 5. Caching Search Engine Results – Prior Art • Markatos, 2001: applied classical replacement policies (LRU, SLRU) to a 1M query log from Excite; demonstrated hit rates of ~30% • Replacement policies tailored for search engines: – PDC: Probability Driven Cache (Lempel & Moran, 2003) – SDC: Static/Dynamic Cache (Silvestri, Fagni, Orlando, Palmerini and Perego, 2003) – AC: Admission-based Caching (Baeza-Yates, Junqueira, Plachouras and Witschel, 2007) • Other observations and approaches: – Lempel & Moran, 2004: theoretical study via competitive analysis – Gan & Suel, 2009: optimizing query evaluation work rather than hit rates – Cambazoglu, Junqueira, Plachouras, Banachowski, Cui, Lim, and Bridge, 2010: refreshing aged entries during times of low back-end load - 5 -
  • 6. - 6 - Traditional View:  Dilemma: Freshness versus Computation  Extreme #1: do not cache at all – evaluate all queries  100% fresh results, lots of redundant evaluations  Extreme #2: never invalidate the cache  A majority of stale results – results refreshed only due to cache replacement, no redundant work  Middle ground: invalidate periodically (TTL)  A time-to-live parameter is applied to each cached entry
  • 7. Caching in the Presence of Index Changes • Increasing importance of freshness in search: – News, Blogs, Twitter, Social, Reviews, Local… • Moving towards “Real Time Crawling”: – Latency measured in seconds instead of hours or days. • Caching, by definition, returns OLD results – Traditionally as TTL  0, caching hit rate  0 • Can we have our cake and eat it too? – Can a cache operate on a very-fast changing collection? - 7 -
  • 8. Cache Invalidation Predictors - 8 -  Main idea:  Not ALL the documents change ALL the time.  When a document changes (or is created / deleted) remove from the cache any quires that may have returned it.  e.g: a new document on Spanish cooking arrives; no need to invalidate queries about quantum physics!  Cache Invalidator Predictors (CIP): 1. Capture document insertions/updates as they enter the index 2. Using document features, predict which cached search results will be affected by the updates 3. Invalidate those cached entries (equivalent to eviction) 4. Upon document deletion, invalidate any cached entry containing that document
  • 9. Cache Invalidation Predictor Architecture - 9 - CIP Architecture Legend Runtime system Parser/ Tokenizer Index terms Cache Query Engine CIP Synopsis generator queries Indexing pipeline Data flow API calls
  • 10. The Invalidator: Brief Implementation Notes • The CIP needs to quickly locate, given a synopsis (e.g. document), which cached entries it matches • Essentially a reversed search engine: – The synopses are the queries – The queries (whose results are cached) are the documents • (!) Non-negligible cost of communication, indexing and querying. - 10 -
  • 11. - 11 - Some Definitions • At any given time, a query in the cache may be: – Stale: cache entry no longer represents the results the engine would return for q – or not stale. • Stale Rate: proportion of queries for which the search engine returns stale results • (Both computable, given enough computing time!)
  • 12. - 12 - Some Definitions (2)  At any given time a CIP may invalidate a query or not:  True Positive: invalidation of stale query  -  True Negative: non-invalidation of non-stale query  !  False Positive: invalidation of non-stale query  $  False Negative: non-invalidation of stale-query  !  False Negatives are much more expensive than False Positives:  User dissatisfaction vs. computational time  Frequency of query!  Error spread forward in time!  True Negatives lead to huge savings (x query volume)
  • 13. Invalidation Policies – Upon Match  Upon match: invalidate query q whenever the synopsis of document d matches q  E.g., for conjunctive queries, q Í d  e.g. for a BOW engine, stale rate=0 - 13 - The Boston Celtics beat the L.A. Lakers on their home court in the 4th game of the 2010 NBA Finals  Very low stale rate   High FP! $$ URL1 0.875 URL2 0.834 … URL9 0.692 URL10 0.511 URL1 0.924 URL2 0.876 … URL9 0.769 URL10 0.631 URL1 0.899 URL2 0.867 … URL9 0.741 URL10 0.651 URL1 0712 URL2 0.690 … URL9 0.482 URL10 0.375 Oil Spill URL1 0.905 URL2 0.704 … URL9 0.662 URL10 0.583 home URL1 0.999 URL2 0.888 … URL9 0.222 URL10 0.111 Boston Celtics Barack Obama World Cup L.A. Lakers
  • 14. Invalidation Policies – Score Thresholding  Score Thresholding: invalidate q whenever projected score(q,d) is high enough (prerequisite: q matches d)  Requires maintaining the min score per result set, and the stand-alone ability to compute the score of a synopsis w/respect to any query  Reduces FP’s, increases FN’s President Barack Obama criticized Score=0.503 Score=0.681 - 14 - BP yesterday for mishandling the oil spill in the gulf of Mexico URL1 0.875 URL2 0.834 … URL9 0.692 URL10 0.511 URL1 0.924 URL2 0.876 … URL9 0.769 URL10 0.631 URL1 0.899 URL2 0.867 … URL9 0.741 URL10 0.651 URL1 0712 URL2 0.690 … URL9 0.482 URL10 0.375 Oil Spill URL1 0.905 URL2 0.704 … URL9 0.662 URL10 0.583 home URL1 0.999 URL2 0.888 … URL9 0.222 URL10 0.111 Boston Celtics Barack Obama World Cup L.A. Lakers
  • 15. CIP Policies – Synopsis Generation  Full synopsis: entire document + all ranking attributes  Idea: reduce synopsis by dropping stuff “unlikely” to affect scoring  Less communication  but more prediction errors   In this paper:  transfer some fraction of top TF-IDF terms  drop document revisions that didn’t “change much” - 15 - We the people of the United States, in Order to form a more perfect Union, . . . Order People Perfect union
  • 16. Experimental Setting #1  Sandbox experiment – static cache containing fixed query set, controlled document/query dynamics (no interleaving)  Data Source: en.wikipedia.org  History span: 2006 – 2008  2.8 TB, > 1M pages  Dominated by updates (> 95%) - 16 -  Query Source: Y! query log  2 days of queries with a click on Wikipedia (2.5 M)  Sample of 10K queries (9234 unique) chosen u.a.r.  Evaluation pattern:  120 single-day epochs (~4% change/day)  The same 10K query batch at the end of each epoch  Search library: Apache Lucene open-source library
  • 17. CIP Parameters – Notation Summary Ρ Fraction of top-terms in synopsis 0 … 1 δ Revision modification threshold 0 … 1 1s Score thresholding applied? 0/1 τ Time-to-live (TTL) threshold 0 .. ∞ Basic CIP: Ρ = 1, δ = 0, τ = ∞, 1s = 0 - 17 -
  • 18. - 18 - Baseline Comparison Policy False Positives False Negatives Stale Rate No invalidation (TTL τ=∞) 0 0.108 0.768 (!) No caching (TTL τ=0) 0.892 0 0 TTL τ=2 0.446 0.054 0.055 TTL τ=5 0.179 0.086 0.175 Basic CIP (Full synopses, invalidate upon match, threshold=no, τ=∞) 0.679 0.001 0.008 (!)
  • 19. CIP Effectiveness: varying 1s, τ, and Ρ - 19 - Shrinking synopsis Growing TTL
  • 20. CIP Effectiveness: varying 1s and δ - 20 - Increasing revision threshold
  • 21. - 21 - Best-in-Class Picture ??
  • 22. - 22 - Conclusions  The problem of maintaining cached search results over incremental indexes is real, and under-explored  We proposed the CIP framework for real-time search cache management  We proposed an experimental setting for CIPs  Demonstrated a simple CIP that significantly improves over prior art (TTL), and measured sensitivity to various parameters
  • 23. - 23 - Future Work • Analyze a real-world scenario (News) – More drastic update and query dynamics – More realistic implementation to measure cost overhead – Compare to dynamic TTL • Continue Improving CIPs – Better synopsis – Connections between corpus dynamics and query dynamics • Study relation between real-time caching with CIPs and pre-fetching of results
  • 25. Policy Stability: Curbing Stale Results - 25 - Still growing stable Still growing but slowly stable