SlideShare ist ein Scribd-Unternehmen logo
1 von 5
Downloaden Sie, um offline zu lesen
A request skew aware heterogeneous distributed
             storage system based on Cassandra
                                                       Zhen Ye, Shanping Li
                                         Department of Computer Science and Technology
                                                       Zhejiang University
                                                        Hangzhou, China
                                                 {yezhen, shan}@cs.zju.edu.cn


Abstract—many distributed storage systems have been proposed         two technologies these systems likely to use to achieve above
to provide high scalability and high availability for modern web     targets. Data partition can be used to improve scalability and
applications. However, most of those applications only aware         performance while data replication is a good way to get high
data skew while actually request skew is also widely exist and       availability and balance the load.
needed to be considered as well. In this paper, we present a
request skew aware heterogeneous distributed storage system              As we know, most Internet-scale applications exhibit highly
based on Cassandra—a famous NoSQL database aiming to                 skewed workload, including data skew and request skew,
manage very large scale data without single point of failure. We     which the system should distribute to its nodes evenly in order
improve Cassandra through two ways: 1) minimize forward              to improve their usability. However, data skew and request
request load by shifting the node where the client application       skew may have contradiction. Sometimes the data already
connect to the one which can handle maximum number of                distributed evenly while request load still skew to one or some
skewed request dynamically; 2) when balancing data load among        nodes, or vise verse. Although now most of the system said
all nodes within the cluster, take their storage capacity into       they aware those skews, actually many of them only care about
consideration. The results of our experiment present that we can     balancing the data into different nodes. Balancing both request
reduce about 25% forward read request and 15% forward write          skew and data skew is still a challenge issue in many systems.
request through approach 1) and balance storage utilization of       In addition, when balancing the data, many systems assume all
each node obviously after applying 2).                               the nodes are identical, but actually in distributed environment,
   Keywords: Distributed Storage System; NoSQL Database;
                                                                     different nodes may have different capacity, which we should
Request Skew; Heterogeneous environment                              take into consideration when designing the system.
                                                                         In this paper, we present a request skew aware
                       I.    INTRODUCTION                            heterogeneous distributed storage system which is based on
                                                                     Cassandra [5]. Cassandra is a NoSQL Database for managing
    Modern web applications probably have to deal with very
                                                                     very large amounts of data spread out across many commodity
large scale data set. For most of these applications, the most
                                                                     servers, while providing highly available service with no single
characteristics they want are high scalability, high availability
                                                                     point of failure. However, Cassandra only provides data skew
and the ability to response quickly even when there exist
                                                                     aware solution while assuming there is no request skew. Also it
hundreds of thousands of request concurrently. Comparing
                                                                     assumes that all the nodes have the same storage capacity. We
with these, strong data consistency and strict transaction
                                                                     improve Cassandra through two ways: 1) minimize forward
support, such as ACID, sometimes can be weakened or even
                                                                     request load by changing the node where the client application
dropped up. Obviously, traditional Relational Database is not
                                                                     connected to the one which can handle maximum number of
suitable for serving these kinds of applications, since most of
                                                                     skewed request dynamically; 2) when balancing storage load
them are transaction based and are hard to scale to very large
                                                                     among all nodes within the cluster, take their storage capacity
size or with a very high cost; in addition, relational database
                                                                     into consideration to maximum utilization.
always provide too large feature set that many of them will not
be used, which only add the cost and complexity [6, 7].                 The remainder of the paper is organized as follows. Section
                                                                     II introduces the related work. Section III presents the
    So in the past few years, many scalable NoSQL related
                                                                     background of Cassandra and also how we improve it. Section
distributed storage systems have been proposed, e.g. Google’s
                                                                     IV studies different experiments. And finally, we present
Bigtable [2], Amazon’s Dynamo [3] and Yahoo!’s PNUTS [4].
                                                                     conclusions and future work in Section V.
Based on CAP theory [8], for distributed system we cannot get
high availability while still maintain the strong consistency. For
this reason, most of these systems relax strong consistency and                          II.   RELATED WORK
strict transaction to get high scalability and availability, thus        Bigtable, Dynamo and Pnuts all are widely studied and
they can scale out dynamically to the internet size, have high       cited in distributed storage system domain. Bigtable is
resilient to the node failure or network failure and serve well to   implemented as sparse, multidimensional sorted maps, which is
the massive access. Data partition and data replication are the      richer than key-value data model while still simple enough. It




                                               978-1-4244-9283-1/11/$26.00 ©2011 IEEE
uses Google File System (GFS) [1] to store data and log             node heterogeneity. It defines the load measure as a function of
information. GFS can divide file into fixed size chunks and         the number of queries issued on data items per time frame and
balances the data load by distributing those chunks into            then proposes a mechanism that balances the system's load by
different nodes evenly. However, GFS use primary copy,              adjusting the DHT structure so that it best captures query load
pessimistic algorithm to synchronize data between different         distributions and node heterogeneity. However, this system
replicate nodes, which make it scale and performance poorly in      does not consider the data replication. Also it assumes that the
the write intensive and wide area scenario.                         forward request and the response message have the same load,
                                                                    which may be not correct.
    Dynamo is a highly available, eventually consistent key-
value storage system that uses Consistent Hashing to increase
scalability, Vector Clock to do reconciliation and Sloppy                                 III.   SYSTEM DESIGN
Quorum and Hinted handoff to handle temporary failure. In              Cassandra is inspired by Bigtable and Dynamo, it integrate
Dynamo, all the nodes and the keys of data items are hashed         Bigtable's Column Family based data model and Dynamo's
and then the output values are mapped to a “ring”. The node’        Eventually Consistency behavior and thus can get both of their
output represent the position of this node in the “ring”, while     advantages.
the key’s output decide which node this item will be stored. In
order to balance the data load, Dynamo map one real node into           As with Dynamo, Cassandra use Consistent Hashing to
many virtual nodes, each of them occupy one position in the         partition and distribute data into different nodes by hashing
“ring”, different node may have different number of virtual         both nodes and data’s value into a “ring”. To improve
nodes, based on their capacity.                                     availability and balance the load, all the data are replicated into
                                                                    N nodes, where N is the replication number that can be
    Pnuts, a geographically distributed database system, use        configured in advance. First, it will assign each data item a
pub/sub message to assure the order of update for one key and       Coordinator Node, which is the first node this data item meet
provide per-record timeline consistency, which is between           when walk clockwise around the “ring”, then replicate this data
strict serialization and eventually consistency. Similar with       item to the next N-1 clockwise successor nodes in the “ring”.
Bigtable, Pnuts use one centralized router to look up the right     To adapt between strong consistency and high performance,
node for a specified key and divide data into many fix sized        Cassandra provide different consistency level options to both
tablet. It can move tablet from overload node to low load node      read and write operations. For write, Consistency.One means
to get data load balanced and will dynamically change node the      the operation will only be routed to the closet replica node.
client connect to reduce forward request.                           Consistency.Quorum means system will route request to
    ecStore [9] is an cloud based elastic storage system which      quorum, usually N/2+1, number of nodes and wait for their
supports data partition and replication automatically, load         responses. Consistency.All means it will route request to all N
balancing, efficient range query and transactional access. It use   replica nodes and waiting for their response. For read, the
stratum architecture: BATON tree based data partition as the        operation will be routed to all replicas, but only will wait
bottom layer to provide highly scalability, 2 tie load-adaptive     specific number of response, others will be received and
replication as the middle layer to balance load and provide         handled in asynchronies way. For Consistency.One, this
highly availability and multi-version optimistic concurrent         number is 1; for Consistency.Quorum, this number is N/2+1
control as the top transaction level to provide data consistency.   and for Consistency.All this number is N.
It use data partition to solve data skew issue and solve request       In this chapter we will introduce how we improve
skew by adding second replicas to those hotspot data. ecStore       Cassandra.
use primary copy optimistic replication and provide adaptive
read consistency. However, if you choose read consistency           A. Minimize forward request load
value equal to the number of replicas, which means it will not
feedback to client until the data has been updated to all               In order to use Cassandra cluster, client application needs to
replicas, the result is the same with using primary copy            connect to a node within the cluster. We name this node as the
pessimistic replication and may cause poor performance issue,       Connected Node. When client read or write a data item, if this
otherwise, you may get corrupt data in the situation where          Connected Node is not one of the replica nodes responsible for
recent updated data has not been synchronized to the nodes that     this item, it has to forward the request to one or more nodes,
you are accessing to.                                               wait for their response and finally feedback to the client. Since
                                                                    most of the applications exists request skew, when you choose
    S. Bianchi et al. [10] studies the load of a P2P system under   different node as Connected Node, the total forward request
biased request workloads. It discovers that those systems show      load will be different. However, client often do not know
a heavy lookup traffic load and also the load that in the           which node has the least forward request load at first. Even
intermediate node that is responsible for forward the access to     client connect to the most loaded node at first, however, as the
the target node. Based on this, the authors propose a way to use    time goes by, the users’ access pattern will change, and thus the
Routing Tables Reorganization to reduce forward request load        hot spot data will also changes. For these two reasons, we need
and use cache and data replicas to reduce local request load to     to change this configuration dynamically to minimize the
balance the traffic load. As it needs hundreds of nodes in the      forward request load.
experiments, the authors just did some simulation experiment.
                                                                       In Cassandra, we will meet three kind of forward request:
   M. Abdallah et al. [11] proposes a load balancing
mechanism that takes into account both data popularity and
•    K1: Request’s consistency level is Consistency.One, in        moveRatio, which can be configured, then we say this local
        this situation the Connected Node only forward request        node is overloaded. If finding one node is overloaded, we will
        to closest replica and wait for its response.                 select the nodes that have enough free space to make sure both
                                                                      themselves and overloaded node’s used ratio less than
   •    K2: If it is read request and consistency level is not        moveRatio after moving the position of overloaded node as
        Consistency.One, Connected Node will forward two              candidates to move with. If there is more than one candidate,
        type of request: One is read request that will be routed      we will select the one who has the minimum original used ratio
        to its closest replica, which then will return whole          as the target node and move the overloaded node beside it to
        message to Connected Node; another is read digest             balance the data between them.
        request sent to other replicas, which only return
        message digest. After get all the response, system will           Since our balancing algorithm is based on average used
        digest the message and compare it with digests got            ratio, so if two nodes’ total storage capacity is very different,
        from other replicas to see if they are the same version.      even their used ratios are similar, their used storage size also
                                                                      will very different. It means one node’s storage data number is
   •    K3: If it is write operation and not Consistency.One,         much more than another one, which sometimes also means this
        Connected Node will forward and wait for                      node has much more request load than that one. We solve this
        blockNumber number of write responses according to            potential issue by setting a variable called allowCapacityRatio.
        the consistency level. Here blockNumber’s value is            For any node whose total storage is larger than
        N/2+1 if consistency level is Consistency.Quorum or           allowCapacityRatio times of minimum nodes’ total storage, we
        N if it is Consistency.All.                                   will use this most allowed capacity to present its total capacity
    When accessing a data item, for different kind of forward         instead.
request, the benefit we get is different after shifting Connected
Node from original node to one of the replicas responsible for               //It will be accessed in Connected Node
this item. In K1, if Connected Node is one of replicas, it does          1: nodes←findNodes(key)
not need to wait any message from other remote nodes, which              2: for node∈nodes do
will improve its response time a lot. In K2, it still needs to wait      3:        load←baseLoad //Each write operation's load
read digest message from other replicas, but Connected Node              4:        if blockNumber equals 1
can handle read request itself. In K3, it still needs to wait            5:                  load←baseLoad *weightOne
(blockNumber-1) number of write response from other
                                                                         6:        end if
replicas, which comparing with other 2 situation, the
                                                                         7:        if (blockNumber great than 1)
improvement is limited.
                                                                                   and isReadOperation()
    Based on this observation, we give out our improvement.              8:                  load←baseLoad *weighRead
The opinion is recording all nodes’ request load in Connected            9:        end if
Node, and assigns different kinds of request with different              10:       addLoad(node, load)
weight. Every specify time, system we compare the max                    11: end for
request load with Connected Node, if its request load is much                       Figure 1. Record each node’s request load
larger than original Connected      Node, then we will
change Connected Nod to that node. Fig. 1 and Fig.2 describe
the Pseudo code.                                                        1: maxNode ← maxLoadNode()
                                                                        2: if getload(maxNode) – getLoad(connectedNode)
B. Consider node’s storage capacity when balancing data                    great than changeFactor* totalClusterLoad()
    among each node                                                     3:        changeConnectedNode(node)
    In order to balance the data load between each nodes,               4: end if
Cassandra monitors each node’s data load information on the                            Figure 2. Change Connected Node
ring. If finding one node is overloaded, system will alleviate its
load by moving its position along the ring. The detail checking
and moving algorithms are described in [12].                            1: localUsedRatio←localUsedSize / localTotalSize
                                                                        2: averageUsedRatio←getClusterAverageUsedRatio()
    However, Cassandra assumes each node has the same                   3: if localUsedRatio great than
storage capacity, it only monitors the storage size each node                moveRatio * averageUsedRatio
has used and then uses this information to judge if there exist         4:       candidateNodes ← find all nodes whose
overloaded node or not. But in the reality, within one                           (usedSize+localUsedSize)/(totalSize+localToal
Cassandra’s cluster, different commodity server nodes may                        Size) less than moveRatio * averageUsedRatio
have different storage capacity, which also need to take into           5:       targetNode←minUsedRatio(candidateNodes)
consideration when balancing the data.                                  6:       Move local node to let newLocalUsedRatio
    In order to maximum each node’s storage utilization, we                      equals newTargetNodeUsedRatio
propose an enhanced data balancing algorithm. We suggest                7: end if
comparing each node’s local storage used ratio to whole                                Figure 3. Storage balance algorithm
cluster’s average storage used ratio. If it is large than
IV.      EXPERIMENT                                    2) Replicas Factor = 2, W = 1, R = 1
   We run series of experiments to evaluate the result. The                       In this round we change the factor to 2; the purpose is to
base Cassandra version we use is 0.6.4. We set up a 6                         see how replicas number affects our algorithm. From table III
commodity server nodes cluster; all nodes are within the same                 we can see if we connect our TPC-W client application to
LAN and same Rack.                                                            Node1, Node2 or Node3 at first, the Connected Node will be
                                                                              changed to Node5 after running the algorithm. Table IV tells us
   We did some changes to the TPC-W benchmark application                     if the Connected Node is shift from Node1 to Node5, it will
and use it in our system. The requests in TPC-W are following                 reduce 25.2% forward read request and 18% forward write
Zipf distribution, which is used very often in the web                        request. When it is changed from Node2 to Node5, then the
application domain to simulate users’ real access model.                      results is 22.5% and 19.2% respectively. For Node5, result is
                                                                              17% and 11.6%.
A. The result of minimize forward request load
                                                                                  Comparing with the result in Round1, we find when other
    The criteria here we use is the total forward request number              configuration remain the same, the less replicas number we use,
for each Node. We assign following values to the variables                    the more chance the Connected Node will be changed, but for
describe in Fig. 1: baseLoad = 1, weightOne = 2, weighRead =                  each change, the improvement is less than what we get from
1.2. As in Fig. 2, we set changeFactor = 5% and make the                      round 1.
system check one day a time to see if need to change
Connected Node or not.
                                                                                        TABLE III.        REQUEST LOAD DIFFERENCE IN ROUND2
    To see how the Replicas Factor and different consistency
                                                                                                Node1        Node2    Node3     Node4     Node5     Node6
level affect the result, we set up 3 round tests, each round will                K1 Request     147          156      187       230       247       195
be run 24 hours, and the units of request number in the                          Total Load     294          312      374       460       494       390
following tables are ten thousands.                                              Factor Diff    8.6%         7.8%     5.2%      1.5%      0%        4.5%
  1) Replicas Factor = 3, W = 1, R = 1
    In this round, each data item has 3 replicas distributed in 3
                                                                                   TABLE IV.         FORWARD REQUEST REDUCED RATIO IN ROUND2
different nodes, all operation’s consistent level are set to
Consistency.One, which means for write operation, it will only                           Forward           Forward            Read Reduced      Write Reduced
touch one node and for read operation, it will only wait for one                        Read request      Write request          Ratio              Ratio
                                                                               Node1    295               139                 25.2%             18%
response while other response will be got in asynchronous way.                 Node2    284               141                 22.5%             19.2%
    We connect client to Node 3 at first, since its Ratio                      Node3    265               129                 17.0%             11.6%
Difference is 1.8%, which is small than changeFactor, so the                   Node5    220               114                 0%                0%
Connected Node has not changed after run the algorithm, but
from Table I, we can see if we connect to Node1 or Node2 at                      3) Replicas Factor = 3, W = 2, R = 2
first, then the Connected Node will be changed to Node4. In                       In round 3 we change both read and write consistency level
this round, there is no K2 and K3 request since all operation are             to Consistency.Quorum, to see how it affect our algorithm.
Consistency.One. Factor Diff is defined as:                                      Since all requests are not Consistency.One, so there is no
   Factor Diffi = (loadmax - loadi) / loadtotal                               K1 request. Table V presents the detail result, from which we
                                                                              can see if we first connected client application to Node1 or
    Table II shows if the Connected Node is Node1 at first, it                Node2, the Connected Node need to be changed to Node4. As
will reduce 34.8% of the forward read request and 31% of the                  we can see in table VI, the read reduced ratio is the same with
forward write request. For Node2, the value is 29.4% and                      round one, write reduced ratio is less than the one in round one.
25.7% respectively. There is no read digest request that needed               That means if the replicas number is the same, the stricter the
to be forward in synchronized way in this round.                              consistency level is, the less improvement it will get.

         TABLE I.         REQUEST LOAD DIFFERENCE IN ROUND1                             TABLE V.          REQUEST LOAD DIFFERENCE IN ROUND3
                 Node1     Node2     Node3     Node4    Node5     Node6                         Node1        Node2    Node3     Node4     Node5     Node6
   K1 Request    232       256       315       349      323       266            K2 Request     154          171      213       236       218       177
   Total Load    464       512       630       698      646       532            K3 Request     78           84       102       113       105       89
   Factor Diff   6.4%      5.0%      1.8%      0%       1.4%      4.5%           Total Load     262.8        289.2    357.6     396.2     366.6     301.4
                                                                                 Factor Diff    6.8%         5.4%     2.0%      0%        1.5%      4.8%


     TABLE II.         FORWARD REQUEST REDUCED RATIO IN ROUND1
                                                                                   TABLE VI.         FORWARD REQUEST REDUCED RATIO IN ROUND3
           Forward          Forward          Read Reduced     Write Reduced
          Read request     Write request        Ratio             Ratio                   Forward        Forward       Forward         Read        Write
 Node1    236             113               34.8%             31%                          Read         Read Digest     Write         Reduced     Reduced
                                                                                          request         request      request         Ratio       Ratio
 Node2    218             105               29.4%             25.7%
                                                                                Node1     236          390             304            34.8%       11.5%
 Node4    154             78                0%                0%
                                                                                Node2     218          390             296            29.4%       9.1%
                                                                                Node4     154          390             269            0%          0%
B. The result of considering storage capacity                                     V.     CONCLUSIONS AND FUTURE WORK
    In this experiment we set moveRatio = 1.5 and                       In this paper we have presented two ways to improve
allowCapacityRatio = 2.                                             Cassandra to make it aware request skew issue and notice
                                                                    different nodes’ capacity when balancing the storage load.
    First we run TPC-W for a long time to populate enough
data into cluster nodes. Then we run load balance script                Firstly, we propose an algorithm that can minimize forward
command several times. In round one we use Cassandra’s              request by shifting Connected Node dynamically to the one
original load balance algorithm, in round two we use our            that can handle maximum number of request locally.
algorithm. Fig. 4 and Fig. 5 present the results. LB1 is the Load
Balance algorithm provided by Cassandra and LB2 is our                  Secondly, we give out a new idea that can improve storage
algorithm.                                                          utilization of each node by using used ratio instead of used size
                                                                    to do data storage balance.
    From Fig. 4 we can see although LB2 cannot distribute data
into different nodes as evenly as LB1, but comparing with the           After that, we did several experiments to evaluate the
original load distribution, it also has significant effect.         effectiveness of our approach. The result shows that in
                                                                    different scenarios, we all can reduce both forward read request
    From Fig. 5, it is obviously that our algorithm can utilize     and forward write request a lot. Also from the experiment we
different nodes’ storage capacity much better than its original     learnt the storage utilization will be balanced and improved
one. As its utilization is more balanced, thus the whole cluster    obviously.
can store more data, which means the storage utilization is
improved.                                                               For now, we only assume all the nodes are within the same
                                                                    datacenter, we will extend our research to different datacenter
                                                                    in the future. Also, currently all data has the same replicas
                                                                    number, in the next step, we will think to add additional
                                                                    adaptive replicas for those nodes that contain spot hot data.

                                                                                                 REFERENCES
                                                                    [1]  S. Ghemawat, H. Gobioff and S. Leung, “The Google File System”, In
                                                                         19th Symposium on Operating Systems Principles, Lake George, New
                                                                         York, 2003, pp. 29–43.
                                                                    [2] F. Chang et al., “Bigtable: A distributed storage system for structured
                                                                         data”, In Proc. OSDI, 2006, pp 205–218.
                                                                    [3] G. DeCandia et al., “Dynamo: amazon’s highly available key-value
                                                                         store”, In Proc. SOSP, 2007, pp. 205-220.
                                                                    [4] B. F. Cooper et al., “PNUTS: Yahoo!’s Hosted Data Serving Platform”,
                                                                         Proc. VLDB Endow, vol. 1, pp. 1277-1288, August 2008.
              Figure 4. Storage size used by each node              [5] A. Lakshman and P. Malik, “Cassandra: Adecentralized structured
                                                                         storage system”, SIGOPS Oper. Syst. Rev. vol. 44, pp. 35-40, 2009.
                                                                    [6] M. Stonebraker, “SQL Databases v. NoSQL Databases”, Commun.
                                                                         ACM, vol. 53, pp. 10-11, April 2010.
                                                                    [7] N. Leavitt, “Will NoSQL Databases Live Up to Their Promise?”,
                                                                         Computer, vol. 43, pp. 12-14, February 2010.
                                                                    [8] E. A. Brewer, “Towards robust distributed systems”, Principles of
                                                                         Distributed Computing, Portland, Oregon, July 2000.
                                                                    [9] H. T. Vo, C. Chen and B. C. Ooi, “Towards elastic transactional cloud
                                                                         storage with range query support”, Proc. VLDB Endow., vol. 3, pp.
                                                                         506–517, 2010.
                                                                    [10] S. Bianchi, S. Serbu, P. Felber and P. Kropf, “Adaptive Load Balancing
                                                                         for DHT Lookups”, ICCCN, 2006, pp. 411-418.
                                                                    [11] M. Abdallah and E. Buyukkaya, “Fair load balancing under skewed
                                                                         popularity patterns in heterogeneous DHT-based P2P systems”,
                                                                         International Conference on Parallel and Distributed Computing and
                                                                         Systems, 2007, pp. 484-490.
              Figure 5. Storage utilization of each node            [12] M. Abdallah and H.C. Le, “Scalable Range Query Processing for
                                                                         Large-Scale Distributed Database Applications,” Proc. Int'l Conf.
                                                                         Parallel and Distributed Computing Systems (PDCS), 2005.

Weitere ähnliche Inhalte

Was ist angesagt?

A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAA NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAijfcstjournal
 
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd Iaetsd
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performanceijcsa
 
Indic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path aheadIndic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path aheadIndicThreads
 
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...IJCERT JOURNAL
 
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...Darshan Gorasiya
 
iaetsd Controlling data deuplication in cloud storage
iaetsd Controlling data deuplication in cloud storageiaetsd Controlling data deuplication in cloud storage
iaetsd Controlling data deuplication in cloud storageIaetsd Iaetsd
 
Seminar.2010.NoSql
Seminar.2010.NoSqlSeminar.2010.NoSql
Seminar.2010.NoSqlroialdaag
 
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...Nandhitha B
 
Big Data - Hadoop Ecosystem
Big Data -  Hadoop Ecosystem Big Data -  Hadoop Ecosystem
Big Data - Hadoop Ecosystem nuriadelasheras
 
Strategies for Distributed Data Storage
Strategies for Distributed Data StorageStrategies for Distributed Data Storage
Strategies for Distributed Data Storagekakugawa
 
Nubilum: Resource Management System for Distributed Clouds
Nubilum: Resource Management System for Distributed CloudsNubilum: Resource Management System for Distributed Clouds
Nubilum: Resource Management System for Distributed CloudsGlauco Gonçalves
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman
 
Altoros using no sql databases for interactive_applications
Altoros using no sql databases for interactive_applicationsAltoros using no sql databases for interactive_applications
Altoros using no sql databases for interactive_applicationsJeff Harris
 

Was ist angesagt? (20)

Datastores
DatastoresDatastores
Datastores
 
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAA NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
 
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasets
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 
Indic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path aheadIndic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path ahead
 
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
 
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
 
iaetsd Controlling data deuplication in cloud storage
iaetsd Controlling data deuplication in cloud storageiaetsd Controlling data deuplication in cloud storage
iaetsd Controlling data deuplication in cloud storage
 
Seminar.2010.NoSql
Seminar.2010.NoSqlSeminar.2010.NoSql
Seminar.2010.NoSql
 
Deduplication - Remove Duplicate
Deduplication - Remove DuplicateDeduplication - Remove Duplicate
Deduplication - Remove Duplicate
 
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...
Introduction to yarn N.Nandhitha II M.Sc., computer science Bon secours colle...
 
Big Data - Hadoop Ecosystem
Big Data -  Hadoop Ecosystem Big Data -  Hadoop Ecosystem
Big Data - Hadoop Ecosystem
 
4 026
4 0264 026
4 026
 
No sql
No sqlNo sql
No sql
 
Strategies for Distributed Data Storage
Strategies for Distributed Data StorageStrategies for Distributed Data Storage
Strategies for Distributed Data Storage
 
Nubilum: Resource Management System for Distributed Clouds
Nubilum: Resource Management System for Distributed CloudsNubilum: Resource Management System for Distributed Clouds
Nubilum: Resource Management System for Distributed Clouds
 
Bh25352355
Bh25352355Bh25352355
Bh25352355
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
Altoros using no sql databases for interactive_applications
Altoros using no sql databases for interactive_applicationsAltoros using no sql databases for interactive_applications
Altoros using no sql databases for interactive_applications
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

Andere mochten auch

Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 
Basic features of distributed system
Basic features of distributed systemBasic features of distributed system
Basic features of distributed systemsatish raj
 
Scalable distributed processing of k nearest neighbor queries over moving obj...
Scalable distributed processing of k nearest neighbor queries over moving obj...Scalable distributed processing of k nearest neighbor queries over moving obj...
Scalable distributed processing of k nearest neighbor queries over moving obj...LeMeniz Infotech
 
IRJET-Concurrency Control Model for Distributed Database
IRJET-Concurrency Control Model for Distributed DatabaseIRJET-Concurrency Control Model for Distributed Database
IRJET-Concurrency Control Model for Distributed DatabaseIRJET Journal
 

Andere mochten auch (6)

Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
574 501-507
574 501-507574 501-507
574 501-507
 
ICICCE0298
ICICCE0298ICICCE0298
ICICCE0298
 
Basic features of distributed system
Basic features of distributed systemBasic features of distributed system
Basic features of distributed system
 
Scalable distributed processing of k nearest neighbor queries over moving obj...
Scalable distributed processing of k nearest neighbor queries over moving obj...Scalable distributed processing of k nearest neighbor queries over moving obj...
Scalable distributed processing of k nearest neighbor queries over moving obj...
 
IRJET-Concurrency Control Model for Distributed Database
IRJET-Concurrency Control Model for Distributed DatabaseIRJET-Concurrency Control Model for Distributed Database
IRJET-Concurrency Control Model for Distributed Database
 

Ähnlich wie A request skew aware heterogeneous distributed

A novel solution of distributed memory no sql database for cloud computing
A novel solution of distributed memory no sql database for cloud computingA novel solution of distributed memory no sql database for cloud computing
A novel solution of distributed memory no sql database for cloud computingJoão Gabriel Lima
 
Cidr11 paper32
Cidr11 paper32Cidr11 paper32
Cidr11 paper32jujukoko
 
Megastore providing scalable, highly available storage for interactive services
Megastore providing scalable, highly available storage for interactive servicesMegastore providing scalable, highly available storage for interactive services
Megastore providing scalable, highly available storage for interactive servicesJoão Gabriel Lima
 
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAA NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAijfcstjournal
 
International Journal of Computational Engineering Research(IJCER)
 International Journal of Computational Engineering Research(IJCER)  International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) ijceronline
 
Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesEditor Jacotech
 
Comparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbComparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbsonalighai
 
Erlang Cache
Erlang CacheErlang Cache
Erlang Cacheice j
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLRamakant Soni
 
NOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdfNOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdfajajkhan16
 
Scalable Web Architecture and Distributed Systems
Scalable Web Architecture and Distributed SystemsScalable Web Architecture and Distributed Systems
Scalable Web Architecture and Distributed Systemshyun soomyung
 
Scalable and adaptive data replica placement for geo distributed cloud storages
Scalable and adaptive data replica placement for geo distributed cloud storagesScalable and adaptive data replica placement for geo distributed cloud storages
Scalable and adaptive data replica placement for geo distributed cloud storagesVenkat Projects
 
Distributed Algorithms
Distributed AlgorithmsDistributed Algorithms
Distributed Algorithms913245857
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
 
access.2021.3077680.pdf
access.2021.3077680.pdfaccess.2021.3077680.pdf
access.2021.3077680.pdfneju3
 
No sql databases explained
No sql databases explainedNo sql databases explained
No sql databases explainedSalil Mehendale
 

Ähnlich wie A request skew aware heterogeneous distributed (20)

P24120125
P24120125P24120125
P24120125
 
A novel solution of distributed memory no sql database for cloud computing
A novel solution of distributed memory no sql database for cloud computingA novel solution of distributed memory no sql database for cloud computing
A novel solution of distributed memory no sql database for cloud computing
 
Cidr11 paper32
Cidr11 paper32Cidr11 paper32
Cidr11 paper32
 
Megastore providing scalable, highly available storage for interactive services
Megastore providing scalable, highly available storage for interactive servicesMegastore providing scalable, highly available storage for interactive services
Megastore providing scalable, highly available storage for interactive services
 
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAA NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
 
International Journal of Computational Engineering Research(IJCER)
 International Journal of Computational Engineering Research(IJCER)  International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunities
 
Comparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbComparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsb
 
Erlang Cache
Erlang CacheErlang Cache
Erlang Cache
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
NOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdfNOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdf
 
Scalable Web Architecture and Distributed Systems
Scalable Web Architecture and Distributed SystemsScalable Web Architecture and Distributed Systems
Scalable Web Architecture and Distributed Systems
 
Data Storage Management
Data Storage ManagementData Storage Management
Data Storage Management
 
Scalable and adaptive data replica placement for geo distributed cloud storages
Scalable and adaptive data replica placement for geo distributed cloud storagesScalable and adaptive data replica placement for geo distributed cloud storages
Scalable and adaptive data replica placement for geo distributed cloud storages
 
Distributed Algorithms
Distributed AlgorithmsDistributed Algorithms
Distributed Algorithms
 
No sql database
No sql databaseNo sql database
No sql database
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
 
access.2021.3077680.pdf
access.2021.3077680.pdfaccess.2021.3077680.pdf
access.2021.3077680.pdf
 
Cassandra tutorial
Cassandra tutorialCassandra tutorial
Cassandra tutorial
 
No sql databases explained
No sql databases explainedNo sql databases explained
No sql databases explained
 

Mehr von João Gabriel Lima

Deep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationDeep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationJoão Gabriel Lima
 
Aplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackAplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackJoão Gabriel Lima
 
Realidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitRealidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitJoão Gabriel Lima
 
Big data e Inteligência Artificial
Big data e Inteligência ArtificialBig data e Inteligência Artificial
Big data e Inteligência ArtificialJoão Gabriel Lima
 
Mineração de Dados no Weka - Regressão Linear
Mineração de Dados no Weka -  Regressão LinearMineração de Dados no Weka -  Regressão Linear
Mineração de Dados no Weka - Regressão LinearJoão Gabriel Lima
 
Segurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoSegurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoJoão Gabriel Lima
 
Segurança na Internet - Google Hacking
Segurança na Internet - Google  HackingSegurança na Internet - Google  Hacking
Segurança na Internet - Google HackingJoão Gabriel Lima
 
Segurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisSegurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisJoão Gabriel Lima
 
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...João Gabriel Lima
 
Mineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoMineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoJoão Gabriel Lima
 
Mineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaMineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaJoão Gabriel Lima
 
Visualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideVisualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideJoão Gabriel Lima
 
REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?João Gabriel Lima
 
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...João Gabriel Lima
 
E-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosE-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosJoão Gabriel Lima
 
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.jsJoão Gabriel Lima
 
Hackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptHackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptJoão Gabriel Lima
 

Mehr von João Gabriel Lima (20)

Cooking with data
Cooking with dataCooking with data
Cooking with data
 
Deep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationDeep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer Segmentation
 
Aplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackAplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full Stack
 
Realidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitRealidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKit
 
JS - IA
JS - IAJS - IA
JS - IA
 
Big data e Inteligência Artificial
Big data e Inteligência ArtificialBig data e Inteligência Artificial
Big data e Inteligência Artificial
 
Mineração de Dados no Weka - Regressão Linear
Mineração de Dados no Weka -  Regressão LinearMineração de Dados no Weka -  Regressão Linear
Mineração de Dados no Weka - Regressão Linear
 
Segurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoSegurança na Internet - Estudos de caso
Segurança na Internet - Estudos de caso
 
Segurança na Internet - Google Hacking
Segurança na Internet - Google  HackingSegurança na Internet - Google  Hacking
Segurança na Internet - Google Hacking
 
Segurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisSegurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentais
 
Web Machine Learning
Web Machine LearningWeb Machine Learning
Web Machine Learning
 
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
 
Mineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoMineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - Clusterização
 
Mineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaMineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e Weka
 
Visualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideVisualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark side
 
REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?
 
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
 
E-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosE-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãos
 
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
 
Hackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptHackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com Javascript
 

Kürzlich hochgeladen

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Kürzlich hochgeladen (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

A request skew aware heterogeneous distributed

  • 1. A request skew aware heterogeneous distributed storage system based on Cassandra Zhen Ye, Shanping Li Department of Computer Science and Technology Zhejiang University Hangzhou, China {yezhen, shan}@cs.zju.edu.cn Abstract—many distributed storage systems have been proposed two technologies these systems likely to use to achieve above to provide high scalability and high availability for modern web targets. Data partition can be used to improve scalability and applications. However, most of those applications only aware performance while data replication is a good way to get high data skew while actually request skew is also widely exist and availability and balance the load. needed to be considered as well. In this paper, we present a request skew aware heterogeneous distributed storage system As we know, most Internet-scale applications exhibit highly based on Cassandra—a famous NoSQL database aiming to skewed workload, including data skew and request skew, manage very large scale data without single point of failure. We which the system should distribute to its nodes evenly in order improve Cassandra through two ways: 1) minimize forward to improve their usability. However, data skew and request request load by shifting the node where the client application skew may have contradiction. Sometimes the data already connect to the one which can handle maximum number of distributed evenly while request load still skew to one or some skewed request dynamically; 2) when balancing data load among nodes, or vise verse. Although now most of the system said all nodes within the cluster, take their storage capacity into they aware those skews, actually many of them only care about consideration. The results of our experiment present that we can balancing the data into different nodes. Balancing both request reduce about 25% forward read request and 15% forward write skew and data skew is still a challenge issue in many systems. request through approach 1) and balance storage utilization of In addition, when balancing the data, many systems assume all each node obviously after applying 2). the nodes are identical, but actually in distributed environment, Keywords: Distributed Storage System; NoSQL Database; different nodes may have different capacity, which we should Request Skew; Heterogeneous environment take into consideration when designing the system. In this paper, we present a request skew aware I. INTRODUCTION heterogeneous distributed storage system which is based on Cassandra [5]. Cassandra is a NoSQL Database for managing Modern web applications probably have to deal with very very large amounts of data spread out across many commodity large scale data set. For most of these applications, the most servers, while providing highly available service with no single characteristics they want are high scalability, high availability point of failure. However, Cassandra only provides data skew and the ability to response quickly even when there exist aware solution while assuming there is no request skew. Also it hundreds of thousands of request concurrently. Comparing assumes that all the nodes have the same storage capacity. We with these, strong data consistency and strict transaction improve Cassandra through two ways: 1) minimize forward support, such as ACID, sometimes can be weakened or even request load by changing the node where the client application dropped up. Obviously, traditional Relational Database is not connected to the one which can handle maximum number of suitable for serving these kinds of applications, since most of skewed request dynamically; 2) when balancing storage load them are transaction based and are hard to scale to very large among all nodes within the cluster, take their storage capacity size or with a very high cost; in addition, relational database into consideration to maximum utilization. always provide too large feature set that many of them will not be used, which only add the cost and complexity [6, 7]. The remainder of the paper is organized as follows. Section II introduces the related work. Section III presents the So in the past few years, many scalable NoSQL related background of Cassandra and also how we improve it. Section distributed storage systems have been proposed, e.g. Google’s IV studies different experiments. And finally, we present Bigtable [2], Amazon’s Dynamo [3] and Yahoo!’s PNUTS [4]. conclusions and future work in Section V. Based on CAP theory [8], for distributed system we cannot get high availability while still maintain the strong consistency. For this reason, most of these systems relax strong consistency and II. RELATED WORK strict transaction to get high scalability and availability, thus Bigtable, Dynamo and Pnuts all are widely studied and they can scale out dynamically to the internet size, have high cited in distributed storage system domain. Bigtable is resilient to the node failure or network failure and serve well to implemented as sparse, multidimensional sorted maps, which is the massive access. Data partition and data replication are the richer than key-value data model while still simple enough. It 978-1-4244-9283-1/11/$26.00 ©2011 IEEE
  • 2. uses Google File System (GFS) [1] to store data and log node heterogeneity. It defines the load measure as a function of information. GFS can divide file into fixed size chunks and the number of queries issued on data items per time frame and balances the data load by distributing those chunks into then proposes a mechanism that balances the system's load by different nodes evenly. However, GFS use primary copy, adjusting the DHT structure so that it best captures query load pessimistic algorithm to synchronize data between different distributions and node heterogeneity. However, this system replicate nodes, which make it scale and performance poorly in does not consider the data replication. Also it assumes that the the write intensive and wide area scenario. forward request and the response message have the same load, which may be not correct. Dynamo is a highly available, eventually consistent key- value storage system that uses Consistent Hashing to increase scalability, Vector Clock to do reconciliation and Sloppy III. SYSTEM DESIGN Quorum and Hinted handoff to handle temporary failure. In Cassandra is inspired by Bigtable and Dynamo, it integrate Dynamo, all the nodes and the keys of data items are hashed Bigtable's Column Family based data model and Dynamo's and then the output values are mapped to a “ring”. The node’ Eventually Consistency behavior and thus can get both of their output represent the position of this node in the “ring”, while advantages. the key’s output decide which node this item will be stored. In order to balance the data load, Dynamo map one real node into As with Dynamo, Cassandra use Consistent Hashing to many virtual nodes, each of them occupy one position in the partition and distribute data into different nodes by hashing “ring”, different node may have different number of virtual both nodes and data’s value into a “ring”. To improve nodes, based on their capacity. availability and balance the load, all the data are replicated into N nodes, where N is the replication number that can be Pnuts, a geographically distributed database system, use configured in advance. First, it will assign each data item a pub/sub message to assure the order of update for one key and Coordinator Node, which is the first node this data item meet provide per-record timeline consistency, which is between when walk clockwise around the “ring”, then replicate this data strict serialization and eventually consistency. Similar with item to the next N-1 clockwise successor nodes in the “ring”. Bigtable, Pnuts use one centralized router to look up the right To adapt between strong consistency and high performance, node for a specified key and divide data into many fix sized Cassandra provide different consistency level options to both tablet. It can move tablet from overload node to low load node read and write operations. For write, Consistency.One means to get data load balanced and will dynamically change node the the operation will only be routed to the closet replica node. client connect to reduce forward request. Consistency.Quorum means system will route request to ecStore [9] is an cloud based elastic storage system which quorum, usually N/2+1, number of nodes and wait for their supports data partition and replication automatically, load responses. Consistency.All means it will route request to all N balancing, efficient range query and transactional access. It use replica nodes and waiting for their response. For read, the stratum architecture: BATON tree based data partition as the operation will be routed to all replicas, but only will wait bottom layer to provide highly scalability, 2 tie load-adaptive specific number of response, others will be received and replication as the middle layer to balance load and provide handled in asynchronies way. For Consistency.One, this highly availability and multi-version optimistic concurrent number is 1; for Consistency.Quorum, this number is N/2+1 control as the top transaction level to provide data consistency. and for Consistency.All this number is N. It use data partition to solve data skew issue and solve request In this chapter we will introduce how we improve skew by adding second replicas to those hotspot data. ecStore Cassandra. use primary copy optimistic replication and provide adaptive read consistency. However, if you choose read consistency A. Minimize forward request load value equal to the number of replicas, which means it will not feedback to client until the data has been updated to all In order to use Cassandra cluster, client application needs to replicas, the result is the same with using primary copy connect to a node within the cluster. We name this node as the pessimistic replication and may cause poor performance issue, Connected Node. When client read or write a data item, if this otherwise, you may get corrupt data in the situation where Connected Node is not one of the replica nodes responsible for recent updated data has not been synchronized to the nodes that this item, it has to forward the request to one or more nodes, you are accessing to. wait for their response and finally feedback to the client. Since most of the applications exists request skew, when you choose S. Bianchi et al. [10] studies the load of a P2P system under different node as Connected Node, the total forward request biased request workloads. It discovers that those systems show load will be different. However, client often do not know a heavy lookup traffic load and also the load that in the which node has the least forward request load at first. Even intermediate node that is responsible for forward the access to client connect to the most loaded node at first, however, as the the target node. Based on this, the authors propose a way to use time goes by, the users’ access pattern will change, and thus the Routing Tables Reorganization to reduce forward request load hot spot data will also changes. For these two reasons, we need and use cache and data replicas to reduce local request load to to change this configuration dynamically to minimize the balance the traffic load. As it needs hundreds of nodes in the forward request load. experiments, the authors just did some simulation experiment. In Cassandra, we will meet three kind of forward request: M. Abdallah et al. [11] proposes a load balancing mechanism that takes into account both data popularity and
  • 3. K1: Request’s consistency level is Consistency.One, in moveRatio, which can be configured, then we say this local this situation the Connected Node only forward request node is overloaded. If finding one node is overloaded, we will to closest replica and wait for its response. select the nodes that have enough free space to make sure both themselves and overloaded node’s used ratio less than • K2: If it is read request and consistency level is not moveRatio after moving the position of overloaded node as Consistency.One, Connected Node will forward two candidates to move with. If there is more than one candidate, type of request: One is read request that will be routed we will select the one who has the minimum original used ratio to its closest replica, which then will return whole as the target node and move the overloaded node beside it to message to Connected Node; another is read digest balance the data between them. request sent to other replicas, which only return message digest. After get all the response, system will Since our balancing algorithm is based on average used digest the message and compare it with digests got ratio, so if two nodes’ total storage capacity is very different, from other replicas to see if they are the same version. even their used ratios are similar, their used storage size also will very different. It means one node’s storage data number is • K3: If it is write operation and not Consistency.One, much more than another one, which sometimes also means this Connected Node will forward and wait for node has much more request load than that one. We solve this blockNumber number of write responses according to potential issue by setting a variable called allowCapacityRatio. the consistency level. Here blockNumber’s value is For any node whose total storage is larger than N/2+1 if consistency level is Consistency.Quorum or allowCapacityRatio times of minimum nodes’ total storage, we N if it is Consistency.All. will use this most allowed capacity to present its total capacity When accessing a data item, for different kind of forward instead. request, the benefit we get is different after shifting Connected Node from original node to one of the replicas responsible for //It will be accessed in Connected Node this item. In K1, if Connected Node is one of replicas, it does 1: nodes←findNodes(key) not need to wait any message from other remote nodes, which 2: for node∈nodes do will improve its response time a lot. In K2, it still needs to wait 3: load←baseLoad //Each write operation's load read digest message from other replicas, but Connected Node 4: if blockNumber equals 1 can handle read request itself. In K3, it still needs to wait 5: load←baseLoad *weightOne (blockNumber-1) number of write response from other 6: end if replicas, which comparing with other 2 situation, the 7: if (blockNumber great than 1) improvement is limited. and isReadOperation() Based on this observation, we give out our improvement. 8: load←baseLoad *weighRead The opinion is recording all nodes’ request load in Connected 9: end if Node, and assigns different kinds of request with different 10: addLoad(node, load) weight. Every specify time, system we compare the max 11: end for request load with Connected Node, if its request load is much Figure 1. Record each node’s request load larger than original Connected Node, then we will change Connected Nod to that node. Fig. 1 and Fig.2 describe the Pseudo code. 1: maxNode ← maxLoadNode() 2: if getload(maxNode) – getLoad(connectedNode) B. Consider node’s storage capacity when balancing data great than changeFactor* totalClusterLoad() among each node 3: changeConnectedNode(node) In order to balance the data load between each nodes, 4: end if Cassandra monitors each node’s data load information on the Figure 2. Change Connected Node ring. If finding one node is overloaded, system will alleviate its load by moving its position along the ring. The detail checking and moving algorithms are described in [12]. 1: localUsedRatio←localUsedSize / localTotalSize 2: averageUsedRatio←getClusterAverageUsedRatio() However, Cassandra assumes each node has the same 3: if localUsedRatio great than storage capacity, it only monitors the storage size each node moveRatio * averageUsedRatio has used and then uses this information to judge if there exist 4: candidateNodes ← find all nodes whose overloaded node or not. But in the reality, within one (usedSize+localUsedSize)/(totalSize+localToal Cassandra’s cluster, different commodity server nodes may Size) less than moveRatio * averageUsedRatio have different storage capacity, which also need to take into 5: targetNode←minUsedRatio(candidateNodes) consideration when balancing the data. 6: Move local node to let newLocalUsedRatio In order to maximum each node’s storage utilization, we equals newTargetNodeUsedRatio propose an enhanced data balancing algorithm. We suggest 7: end if comparing each node’s local storage used ratio to whole Figure 3. Storage balance algorithm cluster’s average storage used ratio. If it is large than
  • 4. IV. EXPERIMENT 2) Replicas Factor = 2, W = 1, R = 1 We run series of experiments to evaluate the result. The In this round we change the factor to 2; the purpose is to base Cassandra version we use is 0.6.4. We set up a 6 see how replicas number affects our algorithm. From table III commodity server nodes cluster; all nodes are within the same we can see if we connect our TPC-W client application to LAN and same Rack. Node1, Node2 or Node3 at first, the Connected Node will be changed to Node5 after running the algorithm. Table IV tells us We did some changes to the TPC-W benchmark application if the Connected Node is shift from Node1 to Node5, it will and use it in our system. The requests in TPC-W are following reduce 25.2% forward read request and 18% forward write Zipf distribution, which is used very often in the web request. When it is changed from Node2 to Node5, then the application domain to simulate users’ real access model. results is 22.5% and 19.2% respectively. For Node5, result is 17% and 11.6%. A. The result of minimize forward request load Comparing with the result in Round1, we find when other The criteria here we use is the total forward request number configuration remain the same, the less replicas number we use, for each Node. We assign following values to the variables the more chance the Connected Node will be changed, but for describe in Fig. 1: baseLoad = 1, weightOne = 2, weighRead = each change, the improvement is less than what we get from 1.2. As in Fig. 2, we set changeFactor = 5% and make the round 1. system check one day a time to see if need to change Connected Node or not. TABLE III. REQUEST LOAD DIFFERENCE IN ROUND2 To see how the Replicas Factor and different consistency Node1 Node2 Node3 Node4 Node5 Node6 level affect the result, we set up 3 round tests, each round will K1 Request 147 156 187 230 247 195 be run 24 hours, and the units of request number in the Total Load 294 312 374 460 494 390 following tables are ten thousands. Factor Diff 8.6% 7.8% 5.2% 1.5% 0% 4.5% 1) Replicas Factor = 3, W = 1, R = 1 In this round, each data item has 3 replicas distributed in 3 TABLE IV. FORWARD REQUEST REDUCED RATIO IN ROUND2 different nodes, all operation’s consistent level are set to Consistency.One, which means for write operation, it will only Forward Forward Read Reduced Write Reduced touch one node and for read operation, it will only wait for one Read request Write request Ratio Ratio Node1 295 139 25.2% 18% response while other response will be got in asynchronous way. Node2 284 141 22.5% 19.2% We connect client to Node 3 at first, since its Ratio Node3 265 129 17.0% 11.6% Difference is 1.8%, which is small than changeFactor, so the Node5 220 114 0% 0% Connected Node has not changed after run the algorithm, but from Table I, we can see if we connect to Node1 or Node2 at 3) Replicas Factor = 3, W = 2, R = 2 first, then the Connected Node will be changed to Node4. In In round 3 we change both read and write consistency level this round, there is no K2 and K3 request since all operation are to Consistency.Quorum, to see how it affect our algorithm. Consistency.One. Factor Diff is defined as: Since all requests are not Consistency.One, so there is no Factor Diffi = (loadmax - loadi) / loadtotal K1 request. Table V presents the detail result, from which we can see if we first connected client application to Node1 or Table II shows if the Connected Node is Node1 at first, it Node2, the Connected Node need to be changed to Node4. As will reduce 34.8% of the forward read request and 31% of the we can see in table VI, the read reduced ratio is the same with forward write request. For Node2, the value is 29.4% and round one, write reduced ratio is less than the one in round one. 25.7% respectively. There is no read digest request that needed That means if the replicas number is the same, the stricter the to be forward in synchronized way in this round. consistency level is, the less improvement it will get. TABLE I. REQUEST LOAD DIFFERENCE IN ROUND1 TABLE V. REQUEST LOAD DIFFERENCE IN ROUND3 Node1 Node2 Node3 Node4 Node5 Node6 Node1 Node2 Node3 Node4 Node5 Node6 K1 Request 232 256 315 349 323 266 K2 Request 154 171 213 236 218 177 Total Load 464 512 630 698 646 532 K3 Request 78 84 102 113 105 89 Factor Diff 6.4% 5.0% 1.8% 0% 1.4% 4.5% Total Load 262.8 289.2 357.6 396.2 366.6 301.4 Factor Diff 6.8% 5.4% 2.0% 0% 1.5% 4.8% TABLE II. FORWARD REQUEST REDUCED RATIO IN ROUND1 TABLE VI. FORWARD REQUEST REDUCED RATIO IN ROUND3 Forward Forward Read Reduced Write Reduced Read request Write request Ratio Ratio Forward Forward Forward Read Write Node1 236 113 34.8% 31% Read Read Digest Write Reduced Reduced request request request Ratio Ratio Node2 218 105 29.4% 25.7% Node1 236 390 304 34.8% 11.5% Node4 154 78 0% 0% Node2 218 390 296 29.4% 9.1% Node4 154 390 269 0% 0%
  • 5. B. The result of considering storage capacity V. CONCLUSIONS AND FUTURE WORK In this experiment we set moveRatio = 1.5 and In this paper we have presented two ways to improve allowCapacityRatio = 2. Cassandra to make it aware request skew issue and notice different nodes’ capacity when balancing the storage load. First we run TPC-W for a long time to populate enough data into cluster nodes. Then we run load balance script Firstly, we propose an algorithm that can minimize forward command several times. In round one we use Cassandra’s request by shifting Connected Node dynamically to the one original load balance algorithm, in round two we use our that can handle maximum number of request locally. algorithm. Fig. 4 and Fig. 5 present the results. LB1 is the Load Balance algorithm provided by Cassandra and LB2 is our Secondly, we give out a new idea that can improve storage algorithm. utilization of each node by using used ratio instead of used size to do data storage balance. From Fig. 4 we can see although LB2 cannot distribute data into different nodes as evenly as LB1, but comparing with the After that, we did several experiments to evaluate the original load distribution, it also has significant effect. effectiveness of our approach. The result shows that in different scenarios, we all can reduce both forward read request From Fig. 5, it is obviously that our algorithm can utilize and forward write request a lot. Also from the experiment we different nodes’ storage capacity much better than its original learnt the storage utilization will be balanced and improved one. As its utilization is more balanced, thus the whole cluster obviously. can store more data, which means the storage utilization is improved. For now, we only assume all the nodes are within the same datacenter, we will extend our research to different datacenter in the future. Also, currently all data has the same replicas number, in the next step, we will think to add additional adaptive replicas for those nodes that contain spot hot data. REFERENCES [1] S. Ghemawat, H. Gobioff and S. Leung, “The Google File System”, In 19th Symposium on Operating Systems Principles, Lake George, New York, 2003, pp. 29–43. [2] F. Chang et al., “Bigtable: A distributed storage system for structured data”, In Proc. OSDI, 2006, pp 205–218. [3] G. DeCandia et al., “Dynamo: amazon’s highly available key-value store”, In Proc. SOSP, 2007, pp. 205-220. [4] B. F. Cooper et al., “PNUTS: Yahoo!’s Hosted Data Serving Platform”, Proc. VLDB Endow, vol. 1, pp. 1277-1288, August 2008. Figure 4. Storage size used by each node [5] A. Lakshman and P. Malik, “Cassandra: Adecentralized structured storage system”, SIGOPS Oper. Syst. Rev. vol. 44, pp. 35-40, 2009. [6] M. Stonebraker, “SQL Databases v. NoSQL Databases”, Commun. ACM, vol. 53, pp. 10-11, April 2010. [7] N. Leavitt, “Will NoSQL Databases Live Up to Their Promise?”, Computer, vol. 43, pp. 12-14, February 2010. [8] E. A. Brewer, “Towards robust distributed systems”, Principles of Distributed Computing, Portland, Oregon, July 2000. [9] H. T. Vo, C. Chen and B. C. Ooi, “Towards elastic transactional cloud storage with range query support”, Proc. VLDB Endow., vol. 3, pp. 506–517, 2010. [10] S. Bianchi, S. Serbu, P. Felber and P. Kropf, “Adaptive Load Balancing for DHT Lookups”, ICCCN, 2006, pp. 411-418. [11] M. Abdallah and E. Buyukkaya, “Fair load balancing under skewed popularity patterns in heterogeneous DHT-based P2P systems”, International Conference on Parallel and Distributed Computing and Systems, 2007, pp. 484-490. Figure 5. Storage utilization of each node [12] M. Abdallah and H.C. Le, “Scalable Range Query Processing for Large-Scale Distributed Database Applications,” Proc. Int'l Conf. Parallel and Distributed Computing Systems (PDCS), 2005.