SlideShare a Scribd company logo
1 of 27
Download to read offline
Top-k queries in real-time with
Cassandra and Intravert
Jonathan Halliday, JBoss
jonathan.halliday@redhat.com

Rui Vieira, Newcastle University
r.vieira2@newcastle.ac.uk
#CassandraEU
What is Top-k ?

#CassandraEU
What is Top-k ?

#CassandraEU
Top-k queries
• Rank matching results for the term(s)
– We don't really care about the scoring
algorithm

• Application: text search
– Documents containing the search words

• Application: log analysis
– Popular URLs in the time period
#CassandraEU
yawn ?
• SELECT document_id, score
FROM data
WHERE term='top-k'
ORDER BY score DESC, document_id
LIMIT 100
• Lunch time!
#CassandraEU
Not so fast...
• SELECT document_id, score
FROM data
WHERE term IN('top-k', 'algorithm')
GROUP BY document_id
ORDER BY score DESC, document_id
LIMIT 100

#CassandraEU
Distributed Top-k
• We have a lot of data
• It's spread out
• We need to combine a subset efficiently
• Map/Reduce to the rescue!
– HiveQL, Stinger, Impala, Hawq

• Easy! But not fast
#CassandraEU
'real-time'
• Web pages, not control systems
• Performance, not Timeliness
• Pre-compute as much as possible
– scores for each term

• Assemble pre-computed fragments at
query time
– 'group by'
#CassandraEU
Naive method
foreach(term in searchTerms) {
SELECT ... FROM ... WHERE ...

}
• Handle group by in the application code
• Inefficient – transfers ALL the data for
each term, even low scores
#CassandraEU
How much data is enough?
• Data is stored keyed (i.e. sorted) by
{ term, score DESC, doc_id }
or { time_period, score DESC, Url }
• Partition keys IN the query params
– We can filter efficiently

• Can we range limit on score?
– Avoid going into the long tail
#CassandraEU
Bring on the clever algorithms
• Smart People thought about this
problem already...
• ...but not in quite the same context
– WAN distributed logs from CDNs

• Identify, adapt and reuse existing
solutions
– faster and less risky than starting over
#CassandraEU
Inside a clever algorithm
• Fetch a little bit of data
• Look at it, decide how much more we
need
• Fetch some more
• Rinse and repeat
– but not too many times.

#CassandraEU
Desirable Characteristics
• Fixed number of communication rounds
is key
• Generality is good
– Cope with any distribution of data

• So is flexibility
– Tune for different use cases

#CassandraEU
Meet the candidates
Three-Phase Uniform Threshold (TPUT)
'Efficient Top-K Query Calculation in Distributed
Networks', Stanford/Princeton, 2004

Hybrid Threshold
'Efficient Processing of Distributed Top-k
Queries', UCSB, 2005

KLEE
'KLEE: a framework for distributed top-k query
algorithms', Max-Planck Institute, 2005
#CassandraEU
Implementation Issues
• Algorithms assume server side code
execution
• Limitations of CQL3 add some round
trips, increase network I/O
• Previous performance comparisons of
algorithms may no longer be valid

#CassandraEU
Data Transfer vs. k

#CassandraEU
Execution Time vs. k

#CassandraEU
Execution Time vs. peers

#CassandraEU
#CassandraEU
YMMV
• Test with your own data
• Test with your own hardware
• Hybrid Threshold for exact top-k
– Intravert optional

• KLEE for tunable approximate top-k
– Inefficient without intravert
– Requires metadata
#CassandraEU
Intravert
• Cassandra++
– Embed and extend the existing server
– Based on Vert.x

• JSON over HTTP, REST API
– yup, virgil did that already

• Multiple commands per call, chain
operations with REFs
#CassandraEU
Intravert
• Server side code execution
– Groovy (for now – Vert.x is polyglot)

• Filter result sets
• Write path triggers
– C* 2.0 has CASSANDRA-1311

• Run groovy scripts on the server
– Easier than extending thrift api
#CassandraEU
Intravert
• Good trade-off between power and
operational complexity
• More complex development cycle
– Not easy to move code between client and
server

• Client not topology aware
– 'run x on each node' not possible
#CassandraEU
Back to the clever algorithms
• Intravert server side execution enables
cleaner, more efficient implementation
• Reduces network round trips
• Some dev and ops complexity increase
• Less complexity than custom server
deployment
– Reuse existing tools
#CassandraEU
Pre-aggregation
• For text search, can't predict common
term sets
• For time periods, can predict contiguous
periods
• Pre-calculate the rollups
– Hours, days, weeks, months
– Reduces number of terms (peers) to group
at query time
#CassandraEU
Really clever algorithms
• Hierarchical node topology
– Map to cassandra ring: same node may
own multiple keys (peers != nodes)

• Budget constrained approximate top-k
– Get as close as possible with the allowable
time and I/O constraints

• Fault tolerance
– Approximation given available nodes
#CassandraEU
Questions?
Or email us:
Jonathan Halliday, JBoss
jonathan.halliday@redhat.com

Rui Vieira, Newcastle University
r.vieira2@newcastle.ac.uk

#CassandraEU

More Related Content

More from DataStax Academy

Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and DriversDataStax Academy
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph DatabasesDataStax Academy
 
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkCassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkDataStax Academy
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and CassandraDataStax Academy
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talkDataStax Academy
 
Client Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayDataStax Academy
 

More from DataStax Academy (20)

Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and Drivers
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph Databases
 
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkCassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and Cassandra
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talk
 
Client Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right Way
 

Recently uploaded

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

  • 1. Top-k queries in real-time with Cassandra and Intravert Jonathan Halliday, JBoss jonathan.halliday@redhat.com Rui Vieira, Newcastle University r.vieira2@newcastle.ac.uk #CassandraEU
  • 2. What is Top-k ? #CassandraEU
  • 3. What is Top-k ? #CassandraEU
  • 4. Top-k queries • Rank matching results for the term(s) – We don't really care about the scoring algorithm • Application: text search – Documents containing the search words • Application: log analysis – Popular URLs in the time period #CassandraEU
  • 5. yawn ? • SELECT document_id, score FROM data WHERE term='top-k' ORDER BY score DESC, document_id LIMIT 100 • Lunch time! #CassandraEU
  • 6. Not so fast... • SELECT document_id, score FROM data WHERE term IN('top-k', 'algorithm') GROUP BY document_id ORDER BY score DESC, document_id LIMIT 100 #CassandraEU
  • 7. Distributed Top-k • We have a lot of data • It's spread out • We need to combine a subset efficiently • Map/Reduce to the rescue! – HiveQL, Stinger, Impala, Hawq • Easy! But not fast #CassandraEU
  • 8. 'real-time' • Web pages, not control systems • Performance, not Timeliness • Pre-compute as much as possible – scores for each term • Assemble pre-computed fragments at query time – 'group by' #CassandraEU
  • 9. Naive method foreach(term in searchTerms) { SELECT ... FROM ... WHERE ... } • Handle group by in the application code • Inefficient – transfers ALL the data for each term, even low scores #CassandraEU
  • 10. How much data is enough? • Data is stored keyed (i.e. sorted) by { term, score DESC, doc_id } or { time_period, score DESC, Url } • Partition keys IN the query params – We can filter efficiently • Can we range limit on score? – Avoid going into the long tail #CassandraEU
  • 11. Bring on the clever algorithms • Smart People thought about this problem already... • ...but not in quite the same context – WAN distributed logs from CDNs • Identify, adapt and reuse existing solutions – faster and less risky than starting over #CassandraEU
  • 12. Inside a clever algorithm • Fetch a little bit of data • Look at it, decide how much more we need • Fetch some more • Rinse and repeat – but not too many times. #CassandraEU
  • 13. Desirable Characteristics • Fixed number of communication rounds is key • Generality is good – Cope with any distribution of data • So is flexibility – Tune for different use cases #CassandraEU
  • 14. Meet the candidates Three-Phase Uniform Threshold (TPUT) 'Efficient Top-K Query Calculation in Distributed Networks', Stanford/Princeton, 2004 Hybrid Threshold 'Efficient Processing of Distributed Top-k Queries', UCSB, 2005 KLEE 'KLEE: a framework for distributed top-k query algorithms', Max-Planck Institute, 2005 #CassandraEU
  • 15. Implementation Issues • Algorithms assume server side code execution • Limitations of CQL3 add some round trips, increase network I/O • Previous performance comparisons of algorithms may no longer be valid #CassandraEU
  • 16. Data Transfer vs. k #CassandraEU
  • 17. Execution Time vs. k #CassandraEU
  • 18. Execution Time vs. peers #CassandraEU
  • 20. YMMV • Test with your own data • Test with your own hardware • Hybrid Threshold for exact top-k – Intravert optional • KLEE for tunable approximate top-k – Inefficient without intravert – Requires metadata #CassandraEU
  • 21. Intravert • Cassandra++ – Embed and extend the existing server – Based on Vert.x • JSON over HTTP, REST API – yup, virgil did that already • Multiple commands per call, chain operations with REFs #CassandraEU
  • 22. Intravert • Server side code execution – Groovy (for now – Vert.x is polyglot) • Filter result sets • Write path triggers – C* 2.0 has CASSANDRA-1311 • Run groovy scripts on the server – Easier than extending thrift api #CassandraEU
  • 23. Intravert • Good trade-off between power and operational complexity • More complex development cycle – Not easy to move code between client and server • Client not topology aware – 'run x on each node' not possible #CassandraEU
  • 24. Back to the clever algorithms • Intravert server side execution enables cleaner, more efficient implementation • Reduces network round trips • Some dev and ops complexity increase • Less complexity than custom server deployment – Reuse existing tools #CassandraEU
  • 25. Pre-aggregation • For text search, can't predict common term sets • For time periods, can predict contiguous periods • Pre-calculate the rollups – Hours, days, weeks, months – Reduces number of terms (peers) to group at query time #CassandraEU
  • 26. Really clever algorithms • Hierarchical node topology – Map to cassandra ring: same node may own multiple keys (peers != nodes) • Budget constrained approximate top-k – Get as close as possible with the allowable time and I/O constraints • Fault tolerance – Approximation given available nodes #CassandraEU
  • 27. Questions? Or email us: Jonathan Halliday, JBoss jonathan.halliday@redhat.com Rui Vieira, Newcastle University r.vieira2@newcastle.ac.uk #CassandraEU