2. 1. This paper employs CCIndex to support multi- CCIndex creates all ComplementalTables and CCTs when
dimensional range queries overcoming the limitations of the OriginalTable is created. CCIndex maintains the index by
Cassandra. The results show that CCIndex gains 2.4 times the procedures of inserting and deleting.
performance over Cassandra’s index scheme with 1%
selectivity, and about 3.7 times performance when the
selectivity is 50% for 2 million records.
2. This paper shows that CCIndex is a general approach for
DOTs, which could gain better performance for DOTs with
slow random read and fast sequential read. This paper shows
that CCIndex improves query performance by about 2 times
on DOTs with fast random read, and achieves an order of
magnitude times performance improvement for the DOTs
whose random read is significantly slower than sequential
read or scan, such as HBase. This paper implements the
CCIndex recovery mechanism indicates that the efficiency of
CCIndex recovery is 33% of that of sequential write for
Cassandra.
3. This paper reveals that Cassandra is optimized for hash
tables rather than ordered tables. Cassandra provides both
consistency hashing and order-preserving hashing, while the
read and scan operations are not optimized for order-
preserving hashing, such as considering pre-fetch for read, and
optimizing scan for range queries over ordered tables.
Cassandra’s strategy is good for hash tables, but inefficient for
ordered tables.
This paper is organized as follows. Section 2 gives the
background. Section 3 illustrates the design and Fig. 1 Data layout of CCIndex.
implementation for CCIndex in Cassandra. Section 4 shows The procedure of writing is shown as Fig. 2. When writing
the experimental results and the discussion on the results. a record into OriginalTable, CCIndex reads the OriginalTable
Section 5 concludes the whole work. by rowkey to get the old values, checks whether the index
values are going to be modified, and then deletes records form
II. BACKGROUND
corresponding CCITs and CCTs when updating index values.
A. CCIndex Analysis After that, CCIndex writes the records to all CCITs and CCTs.
CCIndex is proposed to support multi-dimensional range When deleting a record, CCIndex reads all index values from
queries over DOTs by reorganizing data. CCIndex introduces OriginalTable and deletes records from all CCITs and CCTs.
a ComplementalTable for each index column. A
ComplementalTable stores all columns except the rowkey and
the corresponding index column. The ComplementalTable
rowkey is a concatenation of the index column value, the
original rowkey, and the length of index column value. The
way of generating the rowkey of ComplementalTable ensures
that all the rowkeys are unique and sorted by index column
and the original rowkey. The OriginalTable and the
ComplementalTables are called Complemental Clustering
Index Table (CCIT). CCIT sets the replica factor to 1 to
decrease the storage overhead. CCIndex maintains the
reliability of a CCIT by other CCITs and introduces a
replicated CCT (Complemental Check Table) for each CCIT
Fig. 2 The procedure of writing.
to help data recovery.
In Fig. 1, there is an OriginalTable (CCIT0) with a primary The procedure of multi-dimensional range queries is shown
id and two index columns weight and height. CCIT-W and as Fig. 3. CCIndex estimates result size for each query
CCIT-H (ComplementalTable) are ordered by key1 and key2 condition and selects the condition with the smallest result
respectively. With these CCITs, range queries over id, weight, size to execute range query on corresponding CCIT. CCIndex
or height can be converted to range queries on CCIT0, CCIT- employs other conditions to filter the result got by range query
W or CCIT-H. and returns the ultimate results of multi-dimensional range
CCT stores the rowkey and all index columns of a CCIT. queries.
CCTs are replicated while the CCITs are not replicated.
131
3. ratio of CCIndex to IndexedTable is determined by
the speed ratio of range query to random read.
B. Cassandra Analysis
Cassandra organizes nodes as a ring overlay like Chord to
partition data. Each node manages a part of data in the ring,
with data id from previous node token to this node token.
Records use the same partitioner to map its key to the token
ring. Corresponding node writes records to commitlog and
then to its memtable.
Memtable is a memory structure contains sorted rows.
Memtable is flushed to an SSTable on disk when it is full.
SSTable is a sorted structure flushed one by one and cannot be
modified once flushed, so that records between multiple
SSTables are not sorted as in Fig. 5. Cassandra combines
several old SSTables into a new SSTable by compaction to
reduce the SSTable number. Each node contains more than
Fig. 3 The procedure of multi-dimensional range queries.
one SSTable in most cases.
CCIndex for HBase uses a simple way to estimate the result
size. In HBase, HMaster stores region-to-server mapping
information as in Fig. 4. The mapping information can be
described as a set of <startKey-regionServer>, ordered by
startKey. CCIndex finds the regions covered by each range
query and estimates the result size by the region number.
When HBase has more than 1 region and has max region size
Smax, each region size must be greater than Smax/2 and less than
Smax. Thus CCIndex considers the result size depends on the
region number covered.
Fig. 5 An example of memtable and SSTables in a node.
Like Dynamo, Cassandra keeps strong consistency if W +
R > N, where W and R indicates respectively the minimum
number of nodes that have executed write and read operation
successfully, and N is the number of replication factor.
Cassandra uses different ConsistencyLevels to keep the
balance between consistency and availability. In writing,
ConsistencyLevel.ONE and QUORUM ensure that the write
Fig. 4 The region-to-server mapping of HBase.
operation has been executed successfully on at least 1 and N /
In HBase, the speed of scan is 8.2 times of random read. 2 + 1 node(s). In reading, ONE returns the record responded
The speed of multi-dimensional range query on CCIndex is by the fastest node and QUORUM returns the record in
11.4 times of IndexedTable. majority of most recent records from at least N / 2 + 1 nodes.
The performance of CCIndex is affected by 2 issues: Comparing with ONE, QUORUM has higher latency while
The accuracy of result size estimation. The more maintaining the consistency.
accurate the estimation is, the less unnecessary Cassandra version 0.7+ provides APIs to execute multi-
records will be scanned. dimensional range queries. But there is a limitation that the
The speed ratio of range query to random read. To APIs require at least one equal operator on a configured index
execute a multi-dimensional range query, CCIndex column in the query expression. Cassandra also provides APIs
executes range query on a CCIT and then filters the to execute the range query over rowkey, but the speed of
result. IndexedTable executes range query on an index range query is only 1.3 times of random read.
table to get original rowkeys, and then gets the records In summary, there are three issues of mismatches between
by random read on those rowkeys. Thus the speed HBase and Cassandra, which impose challenges when
utilizing CCIndex for Cassandra.
132
4. 1) The smallest sorted unit is region in HBase while it’s CCIndex encapsulates APIs of HBase and Cassandra, and
node in Cassandra: In HBase, regions are sorted by the exposes the same CCIndex APIs for applications.
rowkey of records. In Cassandra, records are stored in
SSTables and sorted between nodes, while multiple SSTables D. Data recovery
in the same node are not sorted. The difference decreases the CCIndex introduces replicated CCT to help recover the
accuracy of estimating result size. damaged data. This paper implements the data recovery
2) The speed of range query: Cassandra executes range module with CCT in Cassandra.
query by logical scan, traversing all SSTables to find the To recover a record of OriginalTable, CCIndex first reads
‘next’ record, while HBase executes physical scans on regions. CCTs by rowkey to get all index columns. Then CCIndex
3) The differences between HBase and Cassandra on APIs: concatenates the original rowkey and the index column value
To implement CCIndex for Cassandra, the API issue must be to form the rowkey of a certain ComplementalTable. CCIndex
considered, namely how to utilize the different APIs given by tries to read the record by the concatenated rowkey and write
HBase and Cassandra and unify the APIs CCIndex providing the corresponding record into OriginalTable. If the recovery
to the application level. fails, CCIndex tries to recover data by another
ComplementalTable.
III. DESIGN AND IMPLEMENTATION To recover a record on ComplementalTable, CCIndex gets
CCIndex for Cassandra uses different methods to deal with the rowkey of OriginalTable by splitting the given rowkey.
the differences when utilizing CCIndex for Cassandra. Then CCIndex tries to read the record from OriginalTable. If
the reading operation fails, CCIndex uses other index column
A. The smallest sorted unit issue. values got from CCT to recover data by other
As record size between nodes might be unbalanced, the ComplementalTables.
way which CCIndex for HBase uses to estimate result size by To recover a certain range of table, CCIndex scans
covered region number cannot work on Cassandra. This paper corresponding CCT, and uses the methods above to recover
uses a different way to estimate result size, which lies on data records one by one. A range can be split into several parts for
distribution information of Cassandra. multi-thread recovery to increase efficiency.
1) Data distribution information gathering: CCIndex for E. Implementation
Cassandra first adds an API in CassandraClient to gather
SSTable information of a certain node, and then adds a CCIndex for Cassandra prototype uses Cassandra v0.7.2 as
daemon thread Listener in CassandraDaemon. Listener gets code bases and is written in Java.
token ring information from StorageService every other As replica factor of Cassandra associates with keyspace, it
minute. With token-IP mapping, Listener uses the API above is easy for CCIndex for Cassandra to replicate CCTs by
to get SSTable information from every node. Thus each node putting CCTs into a separate keyspace with replica factor 3.
saves the data distribution information of all nodes. Cassandra CCIndex sets keyspace replica factor to 1 for CCIT, and
kernel code is modified without performance degradation. creates one ComplementalTable for each index column.
2) The estimation of result size: CCIndex client uses a
thread Refiner to get data distribution information and token
ring information from Listener, then CCIndex estimates result
size for every query condition:
• Calculate the nodes covered by range. Count the node
number as N3,
• For every node covered, read the SSTable data file total
size S, and file number C,
• Summarize the total size of S, C for all nodes, get N1,
N2.
Each search condition has a tuple [N1, N2, N3]. N1 has
higher priority than N2, and N2 has higher priority than N3.
CCIndex for Cassandra executes range queries on
corresponding CCIT which has the smallest tuple.
B. The speed of range query
The speed of range query is determined by Cassandra
system. The aim of CCIndex for Cassandra is to implement
Fig. 6 The architecture of CCIndex for Cassandra.
CCIndex while making as few changes as possible. The low
speed of range query affects the speed of multi-dimensional CCIndex for Cassandra client connects with a server node
range queries but does not restrict the implementation. to perform operations like inserting, reading and range query.
As Fig. 6 shows, CCIndex for Cassandra uses a connection
C. The API issue
133
5. pool extends from Pelops [11]. The connection pool assigns a not have enough replicas for CCIT. When N changes from 2
random connection to each client to avoid hot spot issue. to 4 and Ls/L changes from 1/30 to 1/10, the overhead ratio
The client gets the token ring and data distribution changes from 10% to 116.7%.
information by sending a query to a certain node to estimate
the query result size. B. Experiment Setup
This paper introduces a benchmark to evaluate the basic
IV. EVALUATION operations throughput, including sequential read/write,
CCIndex for Cassandra is implemented and evaluated random read, and range query. The workload uses a table with
through analysis and experiments. columns rowkey, index1, index2, index3 and data. The length
of rowkey, index1, index2 and index3 are 10 bytes while the
A. Space Overhead Analysis data column is 1 KB. The throughput is defined as rows per
For the given metrics, the performance is easy to be second for all clients.
evaluated through experiments. As to the space overhead, CCIndex builds index for index1, index2, and index3,
theoretical analysis is more suitable. ConsistencyLevel for CCIT is ONE, and is QUORUM for
Here we denote the number of index columns by N, the CCT.
replica factor of Original Cassandra and CCT by R, the Original Cassandra and Cassandra Indexed set replica to 3
average length of the key and all index columns by Ls, and the and ConsistencyLevel to QUORUM. Original Cassandra does
total length of record by L. not build index. Cassandra Indexed builds index for index1,
In Original Cassandra, the space for every record is: index2, and index3.
SORG = L * R The experimental cluster has 5 nodes. Each node has two
(1) 1.8 GHz dual-cores AMD Opteron (tm) Processor270, with 4
In CCIndex, the space for each record is the CCITs plus GB memory. Each node in the cluster has 321 GB RAID5
CCTs. The space for CCITs is: SCSI disks. All nodes are connected by Gigabits Ethernet.
SCCIT = L *( N + 1) (2) Each node uses Red Hat CentOS release 5.3 (kernel 2.6.18),
The space for CCT is: ext3 file system, Sun JDK1.6.0_14. The test runs on another
SCCT = Ls *( N + 1)* R client machine, which has two 2.0 GHz Intel(R) Core(TM)
(3) Duo T5750 Processor , with 3 GB memory, Broadcom
The total space for CCIndex is:
Netlink(TM) fast Ethernet 100M bps. The client uses Ubuntu
SCC = SCCIT + SCCT = ( N + 1)( L + Ls * R) (4) 10.04LTS, ext3 file system, Sun JDK 1.6.0_14.
The space overhead ratio of CCIndex to Original Cassandra The workload in the experiments has 2 million rows; the
is: token of each node is initialized manually to keep load
SCC / SORG − 1 = ( N + 1) / R + ( N + 1)* Ls / L − 1 balance. Each test runs three times to report the average value.
(5) The client uses 25 concurrent threads for sequential write,
In Cassandra, the replica number R is often set to 3. The sequential read, random read and range query, and uses 1
radio is: thread for multi-dimensional range queries.
( N + 1) / 3 + ( N + 1)* Ls / L − 1 (6)
C. Experiment Result
Equation (6) can be plotted as Fig. 7.
The result in Fig. 8 shows that ConsistencyLevel has great
effect on every test, which can be confirmed by the great
differences between the throughput of Cassandra(1) and
Cassandra(3) or Cassandra Indexed(1) and Cassandra
Indexed(3).
The throughput of sequential write for CCIndex is
significantly lower than the Cassandra Indexed and much
lower than the Original Cassandra, because maintaining index
needs extra random read to get row data from OriginalTable,
and if there are old index column values, further delete
operations are needed to update index.
The performance of Original Cassandra(3) and Cassandra
Indexed(3) on range query, random read, and sequential read
Fig. 7 The space overhead ratio of CCIndex to Original Cassandra. Using are nearly identical due to the same implementation. They are
L/Ls values as the horizontal axis. lower than that of CCIndex because of ConsistencyLevel,
From Fig. 7, the overhead ratio drop significantly as the which can be confirmed by the fact that Original Cassandra(1)
Ls/L decreases and the N decreases, which indicates that to and Cassandra Indexed(1) have nearly the same throughput
avoid huge space overhead, there should be less index with CCIndex.
columns in CCIndex and the data length of index columns
should be shorter. When N is smaller than 2, CCIndex would
134
6. CCIndex increases to 3.7 times that of Cassandra Indexed(3).
In the experiment, CCIndex is about 1.8 to 2.7 times as fast as
Cassandra Indexed(1).
In another test on Cassandra Indexed, when MAXVALUE
is 100 and the query expression is 0 < index1 < 10000, 0 <
index2 < 10000 and index3 = 0, exception happens every time
in all 10 attempts while CCIndex performs well. We consider
it happens when many records are discarded by the non-equal
columns ranges.
The throughput of recovery is 1819 records/s in average in
Fig. 10. To recover one record, CCIndex first executes range
query on CCT, writes on CCIT, and random reads on CCIT.
The CCT range query speed is 6013 records/s, while the write
speed on CCIT is 4778 records/s and the random read speed
on CCIT is 4797 records/s. The recovery speed is 1964.7
Fig. 8 Basic Operations for Original Cassandra, Cassandra Indexed and records/s in theory. Comparing with 1819 records/s in practice,
CCIndex. Cassandra(1) is Cassandra with 1 replica and ConsistencyLevel is the recover speed matches the theoretical analysis.
ONE. Cassandra(3) is Cassandra with 3 replica and ConsistencyLevel is
QUORUM. Cassandra Indexed builds index for index columns.
In this experiment, N is 4,Ls/L is 1/30, CCIndex uses 46%
more space than Original Cassandra(3) in theory. The result
shows that Original Cassandra(3) uses 1.39 GB per node
while CCIndex uses 2.12 GB per node, which has 52.6%
space overhead. Because there are memtables not flushed in
memory, we consider the storage overhead confirms the
theoretical analysis.
The tests of multi-dimensional range query writes records
with index1 and index2 whose value is randomly generated
from 0 to 2 million and index3 is randomly generated from 0
to MAXVALUE. In this way, the test could use expression 0
< index1 < 2000000 and 0 < index2 < 2000000 and index3 = Fig. 10 CCIndex recovery speed.
0 to match the requirement of Cassandra API. The
MAXVALUE of index3 is set from 100 to 1 to change the D. Discussion
selectivity from 1% to 100%. The results provide many insights on CCIndex and
The results of multi-dimensional range query test on Cassandra.
different conditions are shown as Fig. 9. When the selectivity 1) Overall, the results show that CCIndex is a general
is under 10%, Cassandra Indexed performs well, but when the approach for DOTs, successfully in improving both
selectivity raises from 20% to 100%, the latency increases performance and query expressiveness.
significantly. 2) The results show that in Cassandra, the sequential read
and random read are the same in throughput and the range
query throughput is only 1.3 times as fast as random read. But
if a client sets Cassandra’s partitioner to OrderedPartitioner, it
suggests that the client is probably willing to use some special
operations on ordered table such as sequential read and range
query. Cassandra could do some optimization like prefetching
and caching on adjacent records.
3) CCIndex is suitable for tables with 2 to 4 index columns.
CCIndex cannot guarantee the reliability with fewer than 2
index columns because the CCITs are not replicated. If there
are more than 4 index columns, the space overhead is more
than 2 times of the Original Cassandra. When a table has more
than 4 columns with query requirements, a solution is to build
index for 2 to 4 most frequently used columns, and to filter the
Fig. 9 Throughput of multi-dimensional range queries by CCIndex , result by non-indexed conditions in applications.
Cassandra Indexed(1) and Cassandra Indexed(3) 4) The throughput of CCIndex is determined by the ratio of
The throughput ratio of CCIndex to Cassandra Indexed(3) range query to random read. This explains why the throughput
is at least 2.4. When the selectivity grows, the throughput of of CCIndex for Cassandra is 2.4 to 3.7 times to Cassandra
135
7. Indexed(3), while the throughput of CCIndex for HBase is 1% to 50% selectivity for 2 million records. This paper shows
11.4 times to that of IndexedTable. CCIndex converts random that CCIndex is a general approach for DOTs, and could gain
read on OriginalTable to range query on CCIT, so its better performance on multi-dimensional range queries for
performance is associated with the speed improvement from DOTs with slow random read and fast sequential read. This
random read to range query. paper implements the CCIndex recovery mechanism and show
During the procedure of multi-dimensional range query, that CCIndex recovery performance is 33% of that for
IndexedTable executes range query and random read for every sequential write in Cassandra. This paper reveals that
record before filtering while CCIndex only needs to execute Cassandra is optimized for hash tables rather than ordered
range query for one time. tables in read and range queries. Cassandra could do some
We denote the speed of range query by Ss, and the speed of optimizing like prefetching and caching on adjacent records.
random read by Sr.
The speed for CCIndex to get records is: ACKNOWLEDGMENT
Scc = S s (7)
This work is supported in part by the Hi-Tech Research and
Development (863) Program of China (Grant No.
The speed for IndexedTable is:
2006AA01A106), and the major national science and
Si = 1/ (1/ S s + 1/ S r ) = S s * Sr / ( S s + Sr ) (8) technology special projects (2010ZX03004-003-03).
The ratio of CCIndex to IndexedTable is:
Scc / Si = ( S s + Sr ) / Sr = 1 + S s / Sr REFERENCES
(9) [1] Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava, Erik Vee,
So the ratio of CCIndex to IndexedTable is decided by the Ramana Yerneni, and Raghu Ramakrishnan, “Efficient bulk insertion
value of Ss / Sr. For HBase, Ss / Sr is equal to 8.2 and Scc / Si is into a distributed ordered table,” in Proceedings of the 2008 ACM
equal to 9.2. As there’s no optimization on query, SIGMOD International conference on Management of Data, 2008.
[2] Ymir Vigfusson, Adam Silberstein, Brian F. Cooper, Rodrigo Fonseca,
IndexedTable filters more records as candidate results. So the “Adaptively parallelizing distributed range queries,” in Proc. VLDB
final ratio of CCIndex to IndexedTable on multi-dimensional Endow., vol. 2, pp. 682–693. VLDB Endowment (2009)
range queries, 11.4, meets the analysis. [3] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh,
From Fig.9, the throughput of CCIndex is 1.9 and 2.4 times Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes,
and Robert E. Gruber, “Bigtable: a distributed storage system for
to Cassandra Indexed(1) and Cassandra Indexed(3) structured data,” in 7th USENIX Symposium on Operating Systems
respectively. CCIndex performs the same with Cassandra Design and Implementation, 2006.
Indexed(1) in random read and scan. [4] Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam
From Fig.8 Ss / Sr is equal to 1.2 on Cassandra Indexed(1), Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel
Weaver, and Ramana Yerneni, “PNUTS: Yahoo!'s hosted data serving
and CCIndex takes more time to filter the result, the final ratio platform,” in Proc. VLDB Endow. vol. 1, pp. 1277--1288. 2008
1.9 is close to the predicted value 2.2. [5] Apache HBase project. [Online]. Available: http://hbase.apache.org/.
[6] Hai Zhuge, "Probabilistic Resource Space Model for Managing
V. CONCLUSIONS Resources in Cyber-Physical Society," IEEE Transactions on Services
Computing, vol. 99, no. PrePrints, 2011
Cassandra is a Distributed Ordered Table supporting multi- [7] Yongqiang Zou, Jia Liu, Shicai Wang, Li Zha, and Zhiwei Xu,
dimensional range queries. However, current design and “CCIndex: a Complemental Clustering Index on Distributed Ordered
implementation of Cassandra have two problems: (1) Tables for Multi-dimensional Range Queries,” in 7th IFIP
International Conference on Network and Parallel Computing, 2010.
Cassandra’s query expression is limited in that there must be
[8] Avinash Lakshman, Prashant Malik, “Cassandra: a decentralized
one dimension with an equal operator in the query expression; structured storage system,” SIGOPS Operating Systems Review, vol.
(2) The performance is poor. With the success of CCIndex 44 issue 2. pp. 35-40. Apr. 2010
scheme in Apache HBase, this paper tries to study the [9] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan
Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan
feasibility of employing CCIndex to improve multi-
Sivasubramanian, Peter Vosshall, and Werner Vogels, “Dynamo:
dimensional range queries in DOTs like Cassandra. amazon's highly available key-value store,” in Proceedings of 21st
There are three mismatches between HBase and Cassandra ACM SIGOPS symposium on Operating systems principles, 2007.
when utilizing CCIndex for Cassandra, which imposes [10] Ion Stoica, Robert Morris, David Karger, Frans Kaashoek, and Hari
Balakrishnan, “Chord: A scalable peer-to-peer lookup service for
challenges: (1) The smallest sorted unit is region in HBase
internet applications,” in Proceedings of the 2001 conference on
while it’s node in Cassandra, so the estimation method in Applications, Technologies, Architectures, and Protocols for Computer
HBase is not suitable for Cassandra; (2) The speed of range Communications, 2001.
query of Cassandra is not fast enough to accelerate the [11] Pelops project. [Online]. Available. https://github.com/s7/scale7-pelops
CCIndex performance; (3) The APIs of HBase and Cassandra
are different.
This paper proposes a new approach to estimate result size
and exposes the same CCIndex APIs for application to tackle
the first and the third mismatch. The speed of range query is
determined by Cassandra system, Cassandra could do some
optimization like prefetching and caching on adjacent records.
The experimental results show that CCIndex gains 2.4 to
3.7 times performance over Cassandra’s index scheme with
136