C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

Top-k queries in real-time with
Cassandra and Intravert
Jonathan Halliday, JBoss
jonathan.halliday@redhat.com

Rui Vieira, Newcastle University
r.vieira2@newcastle.ac.uk
#CassandraEU

Top-k queries
• Rank matching results for the term(s)
– We don't really care about the scoring
algorithm

• Application: text search
– Documents containing the search words

• Application: log analysis
– Popular URLs in the time period
#CassandraEU

yawn ?
• SELECT document_id, score
FROM data
WHERE term='top-k'
ORDER BY score DESC, document_id
LIMIT 100
• Lunch time!
#CassandraEU

Not so fast...
• SELECT document_id, score
FROM data
WHERE term IN('top-k', 'algorithm')
GROUP BY document_id
ORDER BY score DESC, document_id
LIMIT 100

#CassandraEU

Distributed Top-k
• We have a lot of data
• It's spread out
• We need to combine a subset efficiently
• Map/Reduce to the rescue!
– HiveQL, Stinger, Impala, Hawq

• Easy! But not fast
#CassandraEU

'real-time'
• Web pages, not control systems
• Performance, not Timeliness
• Pre-compute as much as possible
– scores for each term

• Assemble pre-computed fragments at
query time
– 'group by'
#CassandraEU

Naive method
foreach(term in searchTerms) {
SELECT ... FROM ... WHERE ...

}
• Handle group by in the application code
• Inefficient – transfers ALL the data for
each term, even low scores
#CassandraEU

How much data is enough?
• Data is stored keyed (i.e. sorted) by
{ term, score DESC, doc_id }
or { time_period, score DESC, Url }
• Partition keys IN the query params
– We can filter efficiently

• Can we range limit on score?
– Avoid going into the long tail
#CassandraEU

Bring on the clever algorithms
• Smart People thought about this
problem already...
• ...but not in quite the same context
– WAN distributed logs from CDNs

• Identify, adapt and reuse existing
solutions
– faster and less risky than starting over
#CassandraEU

Inside a clever algorithm
• Fetch a little bit of data
• Look at it, decide how much more we
need
• Fetch some more
• Rinse and repeat
– but not too many times.

#CassandraEU

Desirable Characteristics
• Fixed number of communication rounds
is key
• Generality is good
– Cope with any distribution of data

• So is flexibility
– Tune for different use cases

#CassandraEU

Meet the candidates
Three-Phase Uniform Threshold (TPUT)
'Efficient Top-K Query Calculation in Distributed
Networks', Stanford/Princeton, 2004

Hybrid Threshold
'Efficient Processing of Distributed Top-k
Queries', UCSB, 2005

KLEE
'KLEE: a framework for distributed top-k query
algorithms', Max-Planck Institute, 2005
#CassandraEU

Implementation Issues
• Algorithms assume server side code
execution
• Limitations of CQL3 add some round
trips, increase network I/O
• Previous performance comparisons of
algorithms may no longer be valid

#CassandraEU

Data Transfer vs. k

#CassandraEU

Execution Time vs. k

#CassandraEU

Execution Time vs. peers

#CassandraEU

YMMV
• Test with your own data
• Test with your own hardware
• Hybrid Threshold for exact top-k
– Intravert optional

• KLEE for tunable approximate top-k
– Inefficient without intravert
– Requires metadata
#CassandraEU

Intravert
• Cassandra++
– Embed and extend the existing server
– Based on Vert.x

• JSON over HTTP, REST API
– yup, virgil did that already

• Multiple commands per call, chain
operations with REFs
#CassandraEU

Intravert
• Server side code execution
– Groovy (for now – Vert.x is polyglot)

• Filter result sets
• Write path triggers
– C* 2.0 has CASSANDRA-1311

• Run groovy scripts on the server
– Easier than extending thrift api
#CassandraEU

Intravert
• Good trade-off between power and
operational complexity
• More complex development cycle
– Not easy to move code between client and
server

• Client not topology aware
– 'run x on each node' not possible
#CassandraEU

Back to the clever algorithms
• Intravert server side execution enables
cleaner, more efficient implementation
• Reduces network round trips
• Some dev and ops complexity increase
• Less complexity than custom server
deployment
– Reuse existing tools
#CassandraEU

Pre-aggregation
• For text search, can't predict common
term sets
• For time periods, can predict contiguous
periods
• Pre-calculate the rollups
– Hours, days, weeks, months
– Reduces number of terms (peers) to group
at query time
#CassandraEU

Really clever algorithms
• Hierarchical node topology
– Map to cassandra ring: same node may
own multiple keys (peers != nodes)

• Budget constrained approximate top-k
– Get as close as possible with the allowable
time and I/O constraints

• Fault tolerance
– Approximation given available nodes
#CassandraEU

Questions?
Or email us:
Jonathan Halliday, JBoss
jonathan.halliday@redhat.com

Rui Vieira, Newcastle University
r.vieira2@newcastle.ac.uk

#CassandraEU

C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

Recommended

Recommended

More Related Content

More from DataStax Academy

More from DataStax Academy (20)

Recently uploaded

Recently uploaded (20)

C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert