Your SlideShare is downloading. ×
Cassandra by example - the path of read and write requests
Nächste SlideShare
Wird geladen in ...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Cassandra by example - the path of read and write requests


Published on

This article describes how Cassandra handles and processes requests. It will help you to get a better impression about Cassandra's internals and architecture. The path of a single read request as well ...

This article describes how Cassandra handles and processes requests. It will help you to get a better impression about Cassandra's internals and architecture. The path of a single read request as well as the path of a single write request will be described in detail.

Published in: Technologie

0 Kommentare
9 Gefällt mir
  • Hinterlassen Sie den ersten Kommentar

Keine Downloads
Bei Slideshare
Aus Einbettungen
Anzahl an Einbettungen
Gefällt mir
Einbettungen 0
No embeds

Inhalte melden
Als unangemessen gemeldet Als unangemessen melden
Als unangemessen melden

Wählen Sie Ihren Grund, warum Sie diese Präsentation als unangemessen melden.

No notes for slide


  • 1. Cassandra by example - the path of read andwrite requestsAbstractThis article describes how Cassandra handles and processes requests. It will help you to get a betterimpression about Cassandras internals and architecture. The path of a single read request as well asthe path of a single write request will be described in detail. This description is based on a single datacenter Cassandra V1.1.4 cluster (default store configuration).Example data modelPlease consider that this article is not an introduction to the Cassandra model. In the examples belowa column family hotel is used. In short, a column family is analogous to tables of the relationaldatabase approach. Each hotel record or row is identified by a unique key. The columns of a hotelrow include the hotel name as well as the category of the hotel.The column family hotel lives inside the keyspace book_a_hotel . A keyspace can be described byanalogy as a tablespace or database.ThriftThe common way to access Cassandra is using Thrift. Thrift is a language-independent RPC protocoloriginally developed at Facebook and contributed to Apache. Although Thrift is widely supported bythe most popular programming languages the Cassandra project suggests using higher levelCassandra clients such as Hector or Astyanax instead the raw Thrift-based API. In general these highlevel clients try to hide the underlying middleware protocol.Gregor Roth Cassandra by example - the path of read and write requests 1
  • 2. The listing below shows a simple query by using the Hector client library V1.1.// [1] prepare the client (cluster)Cluster cluster = HFactory.getOrCreateCluster("TestClstr", ",,");Keyspace keyspaceOperator = HFactory.createKeyspace("book_a_hotel", cluster);// [2] create the query (fetching the column category)SliceQuery<String, String, String> query = HFactory.createSliceQuery(keyspaceOperator,AsciiSerializer.get(), StringSerializer.get(), StringSerializer.get());query.setColumnFamily("hotel");query.setKey("26813445");query.setColumnNames("category");// [3] perform the requestQueryResult<ColumnSlice<String, String>> result = query.execute();ColumnSlice<String, String> row = result.get();String category = row.getColumnByName("category").getValue();//...// [4] release the client (cluster)cluster.getConnectionManager().shutdown();In the first line of the listing a set of server IP addresses is passed over by creating a Hector Clusterobject. The server address identifies a single Cassandra node. A collection of independent Cassandranodes (the Cassandra cluster) represents the Cassandra database. Within this cluster all nodes arepeers. No master node or something like that exists.The client is free to connect any Cassandra node to perform any request. In the listing above 3addresses are configured. This does not mean that the Cassandra cluster consist of 3 nodes. It justdefines that the client will communicate with these nodes only.The connected Cassandra node plays two roles, potentially. In each case the connected node is thecoordinator node which is responsible to handle the dedicated request. Furthermore the connectednode will be a replica store node, if the node is responsible to store a replica of the requested data.For instance the requested Pavillon Nation hotel record of the example above does not have to bestored on the connected node. Often the coordinator node has to send sub requests to other replicanodes to be able to handle the request. As shown in the diagram below the notes, and would not able to serve a Pavillon Nation query in a direct waywithout sub requesting other nodes.Gregor Roth Cassandra by example - the path of read and write requests 2
  • 3. Please consider that a coordinator node and a replica node is a role description of a Cassandra nodein context of a dedicated read or write operation. All Cassandra nodes can be a coordinator node aswell as a replica node.Hector uses a round-robin strategy to select the node to use. By executing the example query Hectorfirst connects one of the configured nodes. The connect request will be handled on the server-side bythe CassandraServer .By default the CassandraServer is bound to server port 9160 during the start sequence of aCassandra node. The CassandraServer implements Cassandras Thrift interface which defines remoteprocedure methods such as set_keyspace(…) or get_slice(…).This meansCassandras Thrift interface isstateful, implicitly. The Hector client has to call the remote method set_keyspace(..) first to assign thekeyspace book_a_hotel to the current connection session. After assigning the keyspace theget_slice(..) can be called to request the columns of the Pavillon Nation hotel.However, you are not forced to use Thrift to access Cassandra. Several alternative open-sourceconnectors such as REST-based connectors exist.Determining the replica nodesThe CassandraServer is responsible to handle the client-server communication only. Internally, theCassandraServer calls the local StorageProxy class to process the request. The StorageProxyimplements the coordinator logic. The coordinator logic includes determining the replica notes forthe request row key as well as requesting these replica nodes.By default a RandomPartitioner is used to determine the replica nodes for the row key of therequest. The RandomPartinitoner spreads the data records (rows) evenly across the Cassandra nodeswhich are arranged in a circular ring. Within this ring each node is assigned to a range of hash values(tokens). To determine the first replica, the MD5 hash of the row key will be calculated and the nodewill be selected where the key hash maps with the assigned token range.Gregor Roth Cassandra by example - the path of read and write requests 3
  • 4. For instance the token of the Pavillon Nations row key 26813445 is91851936251452796391746312281860607309. This token is within the token range of node172.39.126.86 which means that node is responsible to store a replica of the PavillonNation record.In most case a replica is stored by more than one node which depends on the key spaces replicationfactor. For instance a replication factor 2 means the clockwise next node of the ring will store thereplica, too. If replication level is 3, the next of the next will also store the replica and so forth.Processing a read requestThe handle a read request the StorageProxy (which is the coordinator of the request) determines thereplica nodes as described above. Additionally, the StorageProxy checks that enough replica nodesare alive to handle the read request. If this is true, the replica nodes will be sorted by proximity(closest node first) and the first replica node will be called to get the requested row data.In contrast to the thrift-based client-server communication the Cassandra nodes interchange data byusing a message-oriented tcp-based protocol. This means the StorageProxy will get the requestedrow data by using Cassandras messaging protocol.Calling other replica nodes depends on the consistency level. The consistency level is specified by theclient request. If consistency level ONE is required, no further replica nodes will be called. Ifconsistency level QUORUM is required, in total (replication_factor / 2) + 1 replica nodeswill be called.In contrast to the first full-data read call all additional calls are digest calls. A digest call queries asingle MD5 hash of all column names, values and timestamps instead requesting the complete rowdata. The hashes of all calls, including the first one will be compared together. If a hash does notmatch, the replicas will be inconsistent and the out-of-date replicas will be auto-repaired during theGregor Roth Cassandra by example - the path of read and write requests 4
  • 5. read process. To do this, a full-data read request will be sent to the additional nodes, the most recentversion of data will be computed and the diff will be sent to out-of-date replicas.Occasionally all replica nodes for the row key will be called independent of the requested consistencylevel. This depends on the column familys read_repair_chance parameter. This configurationparameter specifies the probability with which read repairs should be invoked. The default value of0.1 means that a read repair is performed 10%. However, the client response will always beanswered regarding to the requested consistency level. Additional work will be done in background.A read_repair_chance parameter larger the 0 ensures that frequently read data remains consistenteven though only consistency level ONE is required. The row becomes consistent eventually.Performing the local data queryAs already mentioned above, a dedicated messaging protocol is used for inter-node communication.Similar to the CassandraServer the MessagingService will be started during the start sequence of aCassandra node, too. By default the MessagingService in bound to server port 7000.The replica node will receive the read call from the coordinator node through the replica nodesMessagingService. However, the MessagingService will not access the local store in a direct way. Toread and write data locally, the ColumnFamilyStore has to be used. Roughly speaking, theColumFamilyStore represents the underlying local store of a dedicated column family.Please consider that a coordinator node can also be in role replica node. This will be true, if the clientcalls node to get the Mister bed city row instead of the Pavillon Nation row in theexample above. In this case the StorageProxy of the coordinator node will not call theGregor Roth Cassandra by example - the path of read and write requests 5
  • 6. MessagingService of the same node. To avoid remote calls to the same node, the StorageProxy willcall the ColumnFamilyStore in the same way the MessagingServices does to access local data.By processing a query the ColumnFamilyStore will try to read the requested row data through therow cache, if the row cache is activated for the column family. The row cache holds the entire rowand is deactivated per default. If the row cache contains the requests row data, no disk IO will berequired. The query will be served very fast by performing in-memory operations only. However, anactivated row cache causes that the full row have to be fetched internally even though a sub set ofcolumns is requested. For this reasons the row cache is often less efficient for large rows and smallsub set queries.If the request row isnt cached, the Memtables and the SSTables (sorted strings table) have to beread. Memtables and SSTables are maintained per column family. SSTables are data files containingrow data fragments and only allow appending data. A Memtable is an in-memory table which bufferswrites. If the Memtable is full, it will be written to disk as a new SSTable file in background. For thisreason the columns of the requested Pavillon Nation row could be fragmented over several SSTablesand unflushed Memtables. For instance one SSTable book_a_hotel-hotel-he-1-Data.db could containthe initial inserted columns ‘name’= ‘Pavillon Nation’ and ‘category’=’4’ of the Pavillon Nation row.Another SSTable book_a_hotel-hotel-he-2-Data.db (or Memtable) could contain the updatedcategory column ‘category’=’5’.If an SSTable exists for the requested column family, first the associated (key-scoped) Bloom filter ofthe SSTable file will be read to avoid unnecessary disk IO. For each SSTable the ColumnFamilyStoreholds an in-memory structure called SSTableReader which contains metadata as wells as the Bloomfilter of the underlying SSTable file. The Bloom filter indicates that the dedicated SSTable couldcontain a row data fragment (false positive are possible, false negative not). If this is true, the keycache will be requested to get the seek position. If not found, the on-disk index will have to bescanned. The fetched seek position will be added to the key cache in this case. Based on the seekposition the row data fragment will be read from the SSTable file. The data fragments of the SSTablesand Memtables will be merged together by using the column timestamp and the requested row datawill be returned to the caller.Gregor Roth Cassandra by example - the path of read and write requests 6
  • 7. Processing an write requestTo insert, update or delete a row Cassandras mutate method has to be called. The listing belowshows such a mutate call by using the Hector client.//...// [1.b] create and perform an updateMutator<String> mutator = HFactory.createMutator(keyspaceOperator, AsciiSerializer.get());mutator.addInsertion("26813445", "hotel", HFactory.createColumn("category", "5", StringSerializer.get(), StringSerializer.get()));MutationResult result = mutator.execute();//...The write path is very the same to the read path. Similar to the read request a write request alsoincludes the required consistency level. However, the coordinator node tries to send a write requestincluding the mutated columns to all replica nodes for the row key.First, the StorageProxy of the coordinator node checks if enough replica notes for the row key arealive regarding to the requested consistency level. If this is true, the write request will be sent to theliving replica nodes. If not, an error response will be returned. Write requests to temporarily failedreplica nodes will be scheduled as a hinted handoff. This means that a hint will be written locallyinstead calling the failed node. Once the failed replica node is back the hint will be sent to this nodeto perform the write operation. By sending the hints the failed nodes becomes consistent to theother nodes. Please consider that hints will not longer store locally, if the failed node is dead longerthan 1 hour (config param max_hint_window_in_ms).The coordinator node returns the response to the client as soon as the replica nodes conforming tothe consistency level have confirmed the update (a hinted write will not count towards therequested consistency level). The updates of the other replica nodes will still be executed inbackground. If an error occurs by updating the replica nodes conforming to the consistency level, anerror response will be returned. However, in this case the already updated nodes will not bereverted. Cassandra does not support distributed transactions, and hence it does not support adistributed rollback.The write operation supports an additional consistency level ANY which means that the mutatedcolumns have to be written to at least one node regardless of whether this node is a replica node forthe key or not. In contrast to consistency level ONE the write will also succeed, if a hinted handoff iswritten (by the coordinator node). However, in this case the mutated columns will not be readableuntil the responsible replica nodes have recovered.Gregor Roth Cassandra by example - the path of read and write requests 7
  • 8. Performing the local updateSimilar to the local data query a local update is triggered by handling a message through theMessagingService or by the StorageProxy. However, in contrast to the read path, first a commit logentry will be written for durability reasons. By default the commit log entry will be written inbackground asynchronously.The mutated columns will also be written into the in-memory Memtable of the column family. Afterinserting the changes the local update is completed.However, the memory size of a Memtable is limited. If the max size is exceeded, the Memtable willbe written to disk as a new SSTable. This is done by a background thread which checks the currentsize of all unflushed Memtables of all ColumnFamilies, periodically. If a Memtable exceeds the maxsize, the background thread replaces the current Memtable by a new one. The old Memtable will bemarked as pending flush and will be flushed by another thread. Under certain circumstances severalpending Memtables for a column family could exists. After writing the Memtable to disk a newSSTableReader referring the written SSTable is created and added to the ColumnFamilyStore. Oncewritten, the SSTable file is immutable. By default the SSTable data will be compressed(SnappyCompression).CompactingThe SSTable file includes the modified columns of the row including their timestamps as well asadditional row meta data. For instance the meta data section includes a (column name-scoped)Bloom Filter which is used to reduce disk IO by fetching columns by name.To reduce fragmentation and save space, SSTable files will be merged into a new SSTable file,occasionally. This compaction will be triggered by a background thread, if the compaction thresholdis exceeded. The compaction threshold can be set for each column family.Gregor Roth Cassandra by example - the path of read and write requests 8
  • 9. About the authorGregor Roth works as a software architect at United Internet group, a leading European InternetService Provider to which GMX, 1&1, and belong. His areas of interest include software andsystem architecture, enterprise architecture management, distributed computing, and developmentmethodologies.Gregor Roth Cassandra by example - the path of read and write requests 9