SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
TSINGHUA SCIENCE AND TECHNOLOGY
ISSNll1007-0214ll02/10llpp233-245
Volume 20, Number 3, June 2015
Frequency and Similarity-Aware Partitioning for Cloud Storage Based
on Space-Time Utility Maximization Model
Jianjiang Li, Jie Wu, and Zhanning Ma
Abstract: With the rise of various cloud services, the problem of redundant data is more prominent in the cloud
storage systems. How to assign a set of documents to a distributed file system, which can not only reduce storage
space, but also ensure the access efficiency as much as possible, is an urgent problem which needs to be solved.
Space-efficiency mainly uses data de-duplication technologies, while access-efficiency requires gathering the files
with high similarity on a server. Based on the study of other data de-duplication technologies, especially the
Similarity-Aware Partitioning (SAP) algorithm, this paper proposes the Frequency and Similarity-Aware Partitioning
(FSAP) algorithm for cloud storage. The FSAP algorithm is a more reasonable data partitioning algorithm than the
SAP algorithm. Meanwhile, this paper proposes the Space-Time Utility Maximization Model (STUMM), which is
useful in balancing the relationship between space-efficiency and access-efficiency. Finally, this paper uses 100
web files downloaded from CNN for testing, and the results show that, relative to using the algorithms associated
with the SAP algorithm (including the SAP-Space-Delta algorithm and the SAP-Space-Dedup algorithm), the
FSAP algorithm based on STUMM reaches higher compression ratio and a more balanced distribution of data
blocks.
Key words: de-duplication; cloud storage; redundancy; frequency
1 Introduction
Cloud computing has become an important part of
the information technology industry in recent times.
The number of cloud storage services is increasing
with most vendors providing a variable amount
of free storage, such as Facebook, Netflix, and
Dropbox; therefore, data volumes are increasing at an
unprecedented rate. These cloud-based services must
solve two key issues: (1) They should be access-
Jianjiang Li and Zhanning Ma are with the Department of
Computer Science and Technology, University of Science
and Technology Beijing, Beijing 100083, China. E-mail:
lijianjiang@ustb.edu.cn; ningzhanma@163.com.
Jie Wu is with the Department of Computer and Information
Sciences, Temple University, Philadelphia, PA 19122, USA.
E-mail: jiewu@temple.edu.
To whom correspondence should be addressed.
Manuscript received: 2015-03-16; accepted: 2015-05-04
efficient to read or write the files in the system in terms
of minimizing network accesses. This is a crucial factor
with increasing network congestion in the underlying
data center networks[1, 2]
. (2) They should be space-
efficient to manage the high volumes of data[3]
.
The continuous integration of global information
leads to more and more high-value data. The
exponential growth of high-value data leads to many
challenges to the enterprise’s IT departments. The
explosive data growth tends to consume a large amount
of storage space of the data center, then forces the
enterprises to buy more storage facilities to upgrade.
At the same time, mass data is also a considerable
challenge for network transmission. Faced with this
phenomenon, we must consider other ways to resolve
the anguish brought by mass data.
From the economic perspective, people always want
to spend the least money to store the most data.
Even if the amount of data has not been large in
www.redpel.com +917620593389
www.redpel.com +917620593389
234 Tsinghua Science and Technology, June 2015, 20(3): 233-245
the past, researchers have already been committed to
discovering efficient compression storage. Therefore,
how to reduce the amount of data transfer to complete
the file synchronization is an issue which can bring
tremendous value.
Space efficiency is often achieved by the process
of de-duplication[2–6]
, which splits all the files in
the system into blocks, and maintains only a unique
copy of each block. De-duplication can improve
the efficiency of storage, and reduce the network
traffic. Typically, most de-duplication solutions have
focused on reducing the space-overhead within a single
server[7–9]
. Under the environment in which the global
data’s quantity grows sharply, at the requests of business
applications and the requirements of legal compliance,
de-duplication technology has become the companies’
first choice. De-duplication technology changes the
data protection method by reducing the amount of
stored data, which also improves the economics of disk
backup. This has also gradually been recognized as the
next generation backup technology in industry, which is
the essential technology in data centers.
At present, de-duplication technology has been
concerned by the researchers and industry, and has had
very fruitful research results. At the same time, many
algorithms have been designed to detect and delete
duplicate data in storage space or network transmission.
Using data de-duplication technology to develop a safe,
stable, and efficient cloud storage system has very
important practical significance, both in terms of saving
storage space and saving network bandwidth, even
saving energy. The development of the de-duplication
technique also enlarges the applying scope. Therefore,
it not only can be used in backup and archiving, but also
can cover the other network and desktop applications.
Reference [10] shows a Similarity-Aware Partitioning
(SAP) approach that is both access-efficient and space-
efficient. But there are many inaccuracies in delta-
encoding. For example, there are three non-empty files
A1, A2, and A3. A1 D B1B2B3B4B5, A2 D B6B7,
and A3 D B1B2B3B6B7 (where Bi is a block).
Based on the delta-encoding of the SAP algorithm,
ıfA1; A2g D fB6; B7g , ıfA1; A3g D fB6; B7g , since
ıfA1; A2g D ıfA1; A3g D fB6; B7g , then judging the
similarity between A1 and A3 is equal to the similarity
between A1 and A2 . This method is lopsided because
it only considers the differences between two files,
without considering the size of the file itself. Moreover,
the complexity of the SAP algorithm is very high;
the amount of data on each server might have a big
difference.
In this paper, we present an efficient Frequency
and Similarity-Aware Partitioning (FSAP) algorithm
for cloud storage based on a Space-Time Utility
Maximization Model (STUMM), which is more
reasonable than the SAP algorithm. When calculating
the similarity between two files, the FSAP algorithm not
only considers the size of the redundant data between
two files, but also considers the size of the two files.
The FSAP algorithm also presents how to remove the
redundancy among the different servers. In different
degrees of further removal of redundancy (by setting
different thresholds), the space used to store files is
different, and the time to access files is also different.
How to balance the relationship between space and
access time to get maximum benefit is a big question. In
order to solve this problem, we proposed a space-time
utility maximization model.
In Sections 2 and 3, we will describe the FSAP
algorithm in detail, and show the reasons why the
FSAP algorithm is more reasonable than the SAP
algorithm. In Section 4, we will present the Space-
Time Utility Maximization Model. The experimental
results are presented in Section 5. We conducted two
experiments: one for comparing the compression ratio
and accessing efficiency between the FSAP algorithm
and the SAP algorithm, and one for the reasonability
of the Space-Time Utility Maximization Model. Finally,
we will describe the related work and draw conclusions.
2 Data Partitioning Based on the Similarity
Coefficient of Files
In this section, we define the similarity coefficient
between two files, compare it with the delta-encoding
principle in the SAP algorithm[10]
, and present the
process of data partitioning.
2.1 The definition of similarity coefficient between
two files
Definition 1 The similarity coefficient between any
two non-empty files Ai and Aj (where, i ¤ j) is
Sim.Ai ; Aj / D .num.Ai  Aj /=num.Ai //
.num.Ai  Aj /=num.Aj // D
.num.Ai  Aj //2
=.num.Ai / num.Aj //:
Ai  Aj refers to the blocks contained by both file
Ai and file Aj . num.Ai  Aj / represents the number of
blocks contained by both file Ai and file Aj . Therefore,
www.redpel.com +917620593389
www.redpel.com +917620593389
Jianjiang Li et al.: Frequency and Similarity-Aware Partitioning for Cloud Storage 235
num.Ai Aj / 0; num.Ai / and num.Aj / indicate the
number of blocks respectively contained by two non-
empty files Ai and Aj , therefore, num.Ai / 1 and
num.Ai / 1.
Property 1 Sim.Ai ; Aj / 2 Œ0; 1:
Generally, 0 num.Ai  Aj / num.Ai / and 0
num.Ai  Aj / num.Aj /. So, 0 Sim.Ai ; Aj / 1.
If file Ai and file Aj are completely different, which
means that the blocks contained by file Ai and file Aj
are completely different, then num.Ai  Aj / D 0. So,
Sim.Ai ; Aj / D 0.
If file Ai and file Aj are identical, which means that
the blocks contained by file Ai and file Aj are identical,
without regard to the order of blocks, then num.Ai 
Aj / D num.Ai / D num.Aj /. So, Sim.Ai ; Aj / D 1.
When num.Ai  Aj / is fixed, Sim.Ai ; Aj / will
decline as the increase of num.Ai / or num.Aj /. In
particular, when num.Ai / num.Ai  Aj / or
num.Aj / num.Ai  Aj /, Sim.Ai ; Aj / is close to 0.
Definition 2 Especially, if one of Ai and Aj
(where, i ¤ j ) is null (i.e., it does not contain any
block), then the similarity coefficient between Ai and
Aj is defined as 0. If Ai and Aj (where, i ¤ j) are
both empty (i.e., both Ai and Aj do not contain any
block), then the similarity coefficient between Ai and
Aj is defined as 1.
Property 2 Sim.Ai ; Aj / D Sim.Aj ; Ai /:
If file Ai and file Aj are two non-empty files,
according to Definition 1, then
Sim.Ai ; Aj / D .num.Ai  Aj //2
=.num.Ai /
num.Aj //;
Sim.Aj ; Ai / D .num.Aj  Ai //2
=.num.Aj /
num.Ai // D .num.Ai  Aj //2
=.num.Ai / num.Aj //:
Therefore, Sim.Ai ; Aj / D Sim.Aj ; Ai /.
If one of Ai and Aj is null, according to Definition 2,
then Sim.Ai ; Aj / D Sim.Aj ; Ai / D 0.
If Ai and Aj are both empty, according to Definition
2, then Sim.Ai ; Aj / D Sim.Aj ; Ai / D 1.
In conclusion, Sim.Ai ; Aj / D Sim.Aj ; Ai /.
According to Property 2, after Sim.Ai ; Aj / is
calculated, we can know the value of Sim.Aj ; Ai /, and
hence cut the computational effort by half.
The similarity coefficient of two non-empty files
proposed in this paper is different from the similarity
of two non-empty files based on delta-encoding in
SAP algorithm[10]
. For example, according to the
representation method of blocks, three non-empty files
A1, A2, and A3 are expressed as A1 D B1B2B3B4B5,
A2 D B6B7, and A3 D B1B2B3B6B7. According to
the delta-encoding in the SAP algorithm,
ıfA1; A2g D fB6; B7g and ıfA1; A3g D fB6; B7g:
That is to say, ıfA1; A2g D ıfA1; A3g D fB6; B7g;
determining the similarity of A1 and A3 is equivalent to
the similarity of A1 and A2. However, the definition of
the similarity coefficient between two non-empty files
proposed by this paper shows
Sim.A1; A2/ D .num.A1  A2//2
=.num.A1/
num.A2// D 0;
Sim.A1; A3/ D .num.A1  A3//2
=.num.A1/
num.A3// D 32
=.5 5/ D 0:36:
That is to say, A1 and A3 are more similar than A1
and A2. From the actual situation of A1, A2, and A3, A1
and A2 do not have common blocks. A1 and A3 have
three common blocks: B1, B2, and B3. It also shows
that the similarity degree between A1 and A3 is much
higher than the similarity degree between A1 and A2.
Therefore, the similarity coefficient between two non-
empty files proposed by this paper is more reasonable
than the delta-encoding principle in the SAP algorithm.
Table 1 shows ten files A1, A2, , A10 according
to the representation method of blocks, and Table 2
shows the similarity coefficients between every two files
in Table 1, calculated according to the definition of
the similarity coefficient proposed by this paper. Table
2 can be regarded as a symmetry matrix, where all
elements on the primary diagonal are 1.
Table 1 Files A1, A2, , A10 according to the representation
method of blocks (where, Bi is a file block).
A1 D B1B2B7 A6 D B3B5B6
A2 D B3B7 A7 D B3B4B6
A3 D B2B4B1B8 A8 D B4B1B7
A4 D B5B3B5B8 A9 D B5B3B7B8
A5 D B6B8 A10 D B4B7B6
Table 2 The similarity coefficients between any two files of
Table 1.
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
A1 1 0.17 0.33 0 0 0 0 0.44 0.08 0.11
A2 0.17 1 0 0.13 0 0.17 0.17 0.17 0.5 0.17
A3 0.33 0 1 0.06 0.13 0 0.08 0.33 0.06 0.08
A4 0 0.13 0.06 1 0.13 0.67 0.08 0 0.75 0
A5 0 0 0.13 0.13 1 0.17 0.17 0 0.13 0.17
A6 0 0.17 0 0.67 0.17 1 0.44 0 0.33 0.11
A7 0 0.17 0.08 0.08 0.17 0.44 1 0.11 0.08 0.44
A8 0.44 0.17 0.33 0 0 0 0.11 1 0.08 0.44
A9 0.08 0.5 0.06 0.75 0.13 0.33 0.08 0.08 1 0.08
A10 0.11 0.17 0.08 0 0.17 0.11 0.44 0.44 0.88 1
www.redpel.com +917620593389
www.redpel.com +917620593389
236 Tsinghua Science and Technology, June 2015, 20(3): 233-245
2.2 The process of data partitioning based on
similarity coefficients
When the total size of all files and the capacity of
each server are known, we demonstrate how to evenly
partition these files among servers according to the
similarity coefficients of files. The files with high
similarity coefficients are likely to be partitioned to
the same server, which minimizes storage space when
keeping a high access efficiency. Next, we will go
over an example to demonstrate the process of data
partitioning based on similarity coefficients.
(1) According to the total size of all files and the
capacity of each server, we can calculate the number
of servers needed, which is the maximum number of
the required servers (in general, after the redundancy
is removed, the space occupied by these files is far
less than the amount of space occupied by these files
stored separately). These files can be placed according
to the calculated number of servers. After completion
of placing these files, if the amount of data placed on
the server is less than the capacity of the server, we can
fine-tune the data. For example, we can fine-tune the
data on the server which has the smallest total similarity
coefficient (or compression) to the other server. Here,
we might as well suppose that the number of servers
is 2 (when the number of server is 1, all of the data is
placed on the same server).
(2) According to the attraction (bigger similarity
coefficient) and repulsion (smaller similarity
coefficient) principle, we can choose pairs of files
which will be placed on different servers. At first,
we find two pairs of files with the largest similarity
coefficient from Table 2, and respectively place them on
server1 and server2. We choose A4 and A9 placed on
server1, because the similarity coefficient of A4 and A9
is 0.75, which is the biggest similarity coefficient. Then
we choose the pair, which has the biggest similarity
coefficient, from the rest of the pairs of files. The
similarity coefficients of A10-A7, A10-A8, A6-A7, and
A1-A8 are all 0.44. As can be seen, among these four
pairs of files, A7, A8, and A10 all appear twice, but A1
and A6 both appear only once. Therefore, we should
select two files from A7, A8, A10 stored on server2.
The similarity coefficients between A7, A8, A10, and
A9 on the other server (i.e., server1) are identical, so
we might as well choose A8 and A10 placed on server2.
This process is shown in Fig. 1.
A4A9 D fB3B5B7B8g (Now the size of data stored
Fig. 1 The first step of data partitioning.
on server1 is 4 blocks.) A8A10 D fB1B4B6B7g (Now
the size of data stored on server2 is 4 blocks.)
(3) From Table 3, we find file A2 which has the
biggest similarity coefficient with the current blocks
stored on server1 (the similarity coefficient is 0.5),
so it will be placed on server1. From Table 4, we
find out files A1 and A7 have the biggest similarity
coefficient with the blocks currently stored on server2
(the similarity coefficients are all 0.33). The similarity
coefficients between files A1, A7 and the block
currently stored on server1 are all 0.08. According to the
attraction and repulsion principle, here we can choose
A1 (of course, we can also choose A7). This process is
shown in Fig. 2.
A4A9A2 D fB3B5B7B8g (Now the size of data
stored on server1 is 4 blocks.) A8A10A1 D
fB1B2B4B6B7g (Now the size of data stored on server2
is 5 blocks.)
(4) From Table 5, we find file A3 which has the
biggest similarity coefficient with the current blocks
stored on server2 (the similarity coefficient is 0.45), so
it will be placed on server2. At the same time, from
Table 6, we find file A6 which has the biggest similarity
coefficient with the block currently stored on server1
(the similarity coefficient is 0.33), so it will be placed
on server1. This process is shown in Fig. 3.
A4A9A2A6 D fB3B5B6B7B8g (Now the size of
data stored on server1 is 5 blocks.) A8A10A1A3 D
Table 3 Similarity coefficients between A4A9 on server1 and
files not placed on any server.
A1 A2 A3 A5 A6 A7
A4A9 0.08 0.5 0.06 0.13 0.33 0.08
Table 4 Similarity coefficients between A8A10 on server2
and files not placed on any server.
A1 A2 A3 A5 A6 A7
A8A10 0.33 0.13 0.25 0.13 0.08 0.33
Fig. 2 The second step of data partitioning.
www.redpel.com +917620593389
www.redpel.com +917620593389
Jianjiang Li et al.: Frequency and Similarity-Aware Partitioning for Cloud Storage 237
Table 5 Similarity coefficients between A8A10A1 on server2
and files not placed on any server.
A3 A5 A6 A7
A8A10A1 0.45 0.1 0.07 0.27
Table 6 Similarity coefficients between A4A9A2 on server1
and files not placed on any server (after file A2 is stored on
server1, blocks on server1 does not change, so the similarity
coefficients do not change).
A3 A5 A6 A7
A4A9A2 0.06 0.13 0.33 0.08
Fig. 3 The third step of data partitioning.
fB1B2B4B6B7B8g (Now the size of data stored on
server2 is 6 blocks.)
(5) From Table 7, we find file A5, which has the
biggest similarity coefficient with the blocks currently
stored on server1 (the similarity coefficient is 0.4), so it
will be placed on server1.
A4A9A2A6A5 D fB3B5B6B7B8g (At present, the
size of the data stored on server1 is still 5 blocks).
As can be seen from Tables 7 and 8, now the
similarity coefficient between A7 and the blocks
currently stored on server1 is larger than the similarity
coefficient between A7 and the blocks currently stored
on server2, and the current number of blocks stored
on server1 is less than the number of blocks stored
on server2, so in order to balance the distribution
of blocks on servers and follow the attraction and
repulsion principle, A7 will be also placed on server1.
This process is shown in Fig. 4.
A4A9A2A6A5A7 D fB3B4B5B6B7B8g (Now
Table 7 Similarity coefficients between A4A9A2A6 on
server1 and files not placed on any server.
A5 A7
A4A9A2A6 0.4 0.27
Table 8 Similarity coefficients between A8A10A1A3 on
server2 and files not placed on any server.
A5 A7
A8A10A1A3 0.33 0.22
Fig. 4 The fourth step of data partitioning.
the size of data stored on server1 is 6 blocks.)
A8A10A1A3 D fB1B2B4B6B7B8g (Now the size of
data stored on server1 is 6 blocks.)
Now, all of files have been placed on the appropriate
servers based on the similarity of files.
2.3 Redundancy removal based on the occurrence
frequency of blocks
From the process of data partitioning described above,
the redundant blocks on server1 and server2 are B4,
B6, B7, and B8. If the capacity of server1 and server2
is greater than the size of blocks needed to be placed
on them, then the redundant blocks do not need to be
removed. Otherwise, we need to decide the degree of
redundancy removal based on the capacity of server1
and server2. The most extreme case is: blocks on all
servers cannot be redundant. In the above example,
among files placed on server1, the occurrence frequency
of B4, B6, B7, and B8 is 1, 3, 2, 3, respectively; among
files placed on server2, the occurrence frequency of B4,
B6, B7, and B8 is 3, 1, 3, 1, respectively. To minimize
the amount of access to the servers which store blocks,
we can remove B4 and B7 stored on the server1, and
remove B6 and B8 stored on server2. Eventually, the
blocks placed on server1 are B3B5B6B8, and the blocks
placed on server2 are B1B2B4B7. To obtain A2, A7 or
A9 stored on server1 is required to access server2 once
(the total number of accesses is 3). Similarly, to obtain
A3 or A10 stored on server2, it is required to access to
server1 once (the total number of times is 2). Therefore,
redundancy removal based on the frequency of blocks
is able to realize the data storage with smaller storage
space and higher access efficiency.
3 The FSAP Algorithm
In Section 2, we have explained the process of the
FSAP algorithm through an example, including how
www.redpel.com +917620593389
www.redpel.com +917620593389
238 Tsinghua Science and Technology, June 2015, 20(3): 233-245
to calculate the similarity coefficients between each
set of two files, how to balance the distribution of
blocks on servers based on the similarity of files, and
how to remove redundancy based on the occurrence
frequency of blocks. The FSAP algorithm can realize
the data storage with smaller storage space and higher
access efficiency. The key step of the FSAP algorithm
(as shown in Algorithm 1) will be introduced in detail
below.
Algorithm 1 The FSAP Algorithm
Input:
n: the total number of files which will be distributed on
servers
set of files A D fA1; A2; ; Ang
C: server capacity
S.A/: the total size of A
K: the maximum number of the required servers
d: remaining server capacity
N.A/: the number of files in A
num.Ai /: the number of blocks in file Ai
q: block size
Output:
the result of data partitioning
1: Initialize: the files in each server: D ∅; K = S.A/=C; d
= C
2: find K pairs files with a similarity coefficient that is the
largest in the set Sim.Ai ; Aj /, 1 i < j n
3: for all 1 m K do
4: add Aim and Ajm into Dm: Dm Dm C Aim C Ajm
5: remove Aim and Ajm from A: A A Aim Ajm ;
N.A/ D N.A/ K
6: L1:
7: construct a K N.A/ matrix M
8: for all 1 x K do
9: for all 1 y N.A/ do
10: Mxy D Sim.Dx; Ay/
11: while M is not empty do
12: find the largest element Mx1y1 in the M
13: if C num.Dx1 [ Ay1 / q then
14: remove x1-th row y1-th column from M
15: Dx1 Dx1 C Ay1
16: A A Ay1
17: else
18: remove x1-th row from M
19: if A is not empty then
20: goto L1
21: according to the occurrence frequency, remove redundant
blocks from Dm, 1 m K
22: remove the redundant blocks among different servers based
on STUMM
23: D D fD1; D2; ; DKg
24: return D
Line 1: The result of data partitioning is initialized,
that is D ∅. At the same time, the remaining
server capacity is initialized by d D C, and
the maximum number of the required servers is
calculated by K D S.A/=C.
Line 2: The similarity coefficients between
each two files (Sim.Ai ; Aj /, 1 i < j n) are
calculated according to the definition proposed in
this paper, then K pairs of files, whose similarity
coefficients are among the top K, are found. The
K pairs of files are denoted by <Aim , Ajm >,
where 1 m n.
Lines 3-5: The K pairs of files are placed on K
servers separately, that is, each server stores a pair
of files, denoted by Dm. Files Aim and Ajm are
added into Dm, then they are deleted from A. After
K pairs of files are processed, the number of files
in A will be reduced correspondingly.
Line 6: L1 is a label of the goto statement.
Lines 7-10: A K N.A/ matrix M is constructed,
denoted by M D Sim.Dx; Ay/ (1 x k, 1
y N.A/). Each row of M corresponds to the
similarity coefficients between the files placed on
each server and the files not placed on any server.
Each element of M is calculated by Sim.Dx; Ay/,
where 1 x K and 1 y N.A/.
Lines 11-18: This is a circulate process when M is
not empty. Only the largest element Mx1y1
in M is
found in each loop; in other words, file Ay1
is the
most suitable file which should be placed on the
x1-th server. Before these operations are executed,
this algorithm needs to judge whether there are
enough remaining spaces. On the other hand, in
order to balance the amount of files on each server,
only one file is placed on each server in each loop.
If C num.Dx1
[ Ay1
/ q, then x1-th row y1-
th column will be removed from M. At the same
time, file Ay1
and Ajm are added into Dx1
, then it
is deleted from A. If C num.Dx1
[ Ay1
/ q,
then x1-th row will be removed from M.
Lines 19-20: If the set of A is not empty, lines 7-18
will be executed repeatedly.
Line 21: If the set of A has become empty,
redundant blocks will be removed from Dm
(where, 1 m K) based on the occurrence
frequency of blocks. The redundant blocks with
lower occurrence frequencies will be removed
from the severs which store them.
Line 22: According to the capacity of each server
www.redpel.com +917620593389
www.redpel.com +917620593389
Jianjiang Li et al.: Frequency and Similarity-Aware Partitioning for Cloud Storage 239
and the access time tolerated by users, the degree
of redundancy removal will be determined. In
Section 4, we will introduce the STUMM model
in detail.
Lines 23-24: the result of data partitioning is
outputted.
The complexity of the FSAP algorithm proposed in
this paper is lower than the SAP algorithm: In the first
step, the similarity coefficients between any two non-
empty files have a large amount of calculations, and
the memory usage is also larger. However, since the
adjacency matrix in the FSAP algorithm is a symmetric
matrix, the amount of calculation and the occupancy of
the memory reduces by half.
The amount of files on each server is nearly same
after data partitioning: In order to balance the amount of
files stored on each server, only one file, which has the
highest similarity, is placed on each server in each loop.
Therefore, the amount of files on each server is nearly
the same. This data partitioning is more reasonable than
that of the SAP algorithm.
4 Space-Time Utility Maximization Model
Definition 3 Space-time utility P is a weighted
average of data space usage and data access time.
Namely,
P D ˛=S C .1 ˛/=T; 0 ˛ 1;
where S is data space usage, which means the total
amount of data space usage on servers. T is data access
time, which indicates the average of data access time
to obtain all data files stored on servers. The SAP
algorithm is just a special case (without consideration
of removing the redundant data blocks among different
storage servers) of the FSAP algorithm proposed in this
paper. In addition, the SAP algorithm only considers
data space usage S. However, the FSAP algorithm not
only considers data space usage S, but also considers
data access time T .
When ˛ D 1, space-time utility P is completely
controlled by data space usage S. On the other hand,
when ˛ D 0, space-time utility P is completely
controlled by data access time T . This paper takes into
account the effect of data space usage and data access
time on space-time utility. Therefore, the range of ˛
should be 0 ˛ 1. If 0 ˛ < 0:5, then data access
time can do more to influence space-time utility than
data space usage. If 0:5 < ˛ 1, then data space usage
can do more to influence space-time utility than data
access time. If ˛ D 0:5, then data space usage does the
same to influence space-time utility as data access time.
This paper will discuss how to solve the problem, under
which circumstances the value of time utility P is the
maximum.
Generally, If all data can be obtained locally, data
access time T is a constant T0 which does not change
with data space usage S. Otherwise, when data space
usage S becomes greater, data access time T will
decrease due to the lower probability of data access
across servers; when data space usage S becomes
smaller, data access time T will increase because of the
higher probability of data access across servers. Data
access time T is a function of data space usage S; here,
suppose that the function is T D ˇ=S, where, ˇ is a
constant greater than 0.
(1) If all data can be obtained locally, then Pmax D
maxf˛=S C .1 ˛/=T0g. At this time, as data space
usage S increases, data access time T is still constant
T0 and will not decrease. Therefore, if all data can
be obtained locally, Pmax D ˛=minS C .1 ˛/=T0.
That is space-time utility P can reach the maximum
value when redundant blocks within each server are
completely eliminated. As we know, this is certain after
data partitioning using the FSAP algorithm is proposed
in this paper. Therefore, Pmax D ˛=S C .1 ˛/=T0.
(2) If some data may be obtained remotely, then
Pmax D maxf˛=SC.1 ˛/=.ˇ=S/g D maxf˛=SC.1
˛/=ˇ Sg and @P=@S D ˛=S2
C.1 ˛/=ˇ. Therefore,
when @P=@S D 0, namely ˛=S2
C.1 ˛/=ˇ D 0, that
is, when S D
p
.˛ ˇ/=.1 ˛/, the space-time utility
P can reach the extreme value ˛=
p
.˛ ˇ/=.1 ˛/ C
.1 ˛/=ˇ
p
.˛ ˇ/=.1 ˛/. S can be determined
according to different values of ˛ and ˇ.
In order to obtain the maximum value of P , both S
and T need to be converted into a unified dimension.
The greater the S, the more storage cost paid by the data
center; the larger the T , the higher the latency which can
be tolerated by the data center.
When the size of files assigned to each server is
much less than the capacity of each server Sfile, the
redundancy may be allowed, i.e., different servers can
have the same blocks. When the size of files on each
server is close to or exceeds the capacity of each
server Sfile, the redundant blocks among servers must
be removed according to the occurrence frequency of
blocks.
While redundancy removal technology improves the
utilization of space, it reduces the reliability of data.
www.redpel.com +917620593389
www.redpel.com +917620593389
240 Tsinghua Science and Technology, June 2015, 20(3): 233-245
Researchers have done experiments to analyze how
much redundancy removal contributes to reducing the
reliability of data[11]
.
The results of these experiments show that the
availability of files is far less than the availability of data
blocks. For example, when the equipment failure rate
reaches 6%, 99.5% of data blocks are available, but only
96% of files are accessible. The reason for the results
is that multiple files share common blocks, so there is a
dependency between different files.
The influence of redundancy removal on the
reliability of data has long attracted the attention of
scholars. The simplest way to improve the reliability
of data is to create a copy using the Mirror technology,
but it will greatly reduce the utilization rate of storage
space, like Ceph[12]
and GPFS[13]
. All of these systems
rarely take into account the utilization rate of storage
space and the reliability of data.
For cloud storage systems, in order to increase
the availability of data, we should also keep some
redundant data. How much redundant data should be
kept is a difficult problem. Using the Space-Time
utility maximization model, according to the capacity
of each server and the access time tolerated by users,
we can determine the degree of redundancy removal
and make a more reasonable operation to obtain the
maximum value of P . In Section 5, we will verify it
by experiments.
5 Performance Evaluation and Analysis
In this section, firstly, compare the compression ratio
and the distribution of data blocks between SAP and
FSAP by experiment; secondly, verify the reasonability
of the Space-Time utility maximization model.
5.1 Comparison of the utilization rate of storage
space of the FSAP algorithm with that of the
SAP algorithm
In this part, when the number of servers is maxf2,
the number of files/20g (Each server is configured as
the following: dual-core CPU and 4 GB memory),
we compare the utilization rate of storage space
of the FSAP1 algorithm, the FSAP2 algorithm,
the SAP-Space-Delta algorithm, and the SAP-Space-
Dedup algorithm. The FSAP1 algorithm is normal
compression, and the FSAP2 algorithm is deep
compression whose threshold value is 2; namely, if the
frequency of the occurrence of redundant data blocks on
one server is smaller than 2, these redundant blocks on
this server will be removed. For example, the frequency
of the occurrence of data block B1 on server1 is 1, and
the frequency of the occurrence of B1 on server2 is 2.
In the case of the FSAP1 algorithm, the redundant data
blocks of B1 on both server1 and server2 will not be
removed, but in the case of the FSAP2 algorithm, B1 on
server1 will be removed, while B1 on server2 will be
retained.
We perform basic experiments on a set of files: 100
random web files downloaded from CNN. In the course
of these experiments, we keep the block-size constant at
48, change the number of files in increments of 5, and
measure the compression ratio achieved by the FSAP1
algorithm, the FSAP2 algorithm, the SAP-Space-
Delta algorithm, and the SAP-Space-Dedup algorithm
relative to the total size of the uncompressed files. The
test results are shown in Table 9 and Fig. 5. On the other
hand, 100 files are stored on 5 servers and Table 10
shows the distribution of data blocks using the FSAP1
algorithm, the FSAP2 algorithm, the SAP-Space-Delta
algorithm, and the SAP-Space-Dedup algorithm.
For files on CNN, the FSAP1 algorithm achieves
a compression ratio of 23%, the FSAP2 algorithm
achieves a compression ratio of 29.35%, the SAP-
Space-Delta algorithm achieves a compression ratio of
Table 9 The test results of the FSAP algorithm and the SAP
algorithm.
The number
of files
Compression ratio (%)
SAP-
Space-Delta
SAP-
Space-Dedup
FSAP1 FSAP2
5 11.14 11.74 12.60 19.97
10 11.42 14.79 14.12 20.60
15 11.80 18.04 18.15 24.05
20 12.49 19.77 19.59 24.95
25 12.00 20.77 20.30 25.52
30 11.26 21.04 20.71 25.73
35 12.81 23.36 22.56 27.32
40 13.71 24.96 25.25 29.67
45 11.60 24.51 25.46 29.74
50 11.80 25.53 26.38 30.62
55 12.72 26.61 26.73 30.89
60 12.52 23.45 23.37 28.71
65 12.38 24.18 23.94 29.27
70 12.16 24.51 24.38 29.56
75 11.93 24.76 24.72 29.72
80 11.83 22.71 23.32 29.18
85 11.63 23.02 23.50 29.30
90 11.52 23.38 23.84 29.64
95 11.86 24.2 24.26 29.98
100 11.65 22.44 23.00 29.35
www.redpel.com +917620593389
www.redpel.com +917620593389
Jianjiang Li et al.: Frequency and Similarity-Aware Partitioning for Cloud Storage 241
Fig. 5 The compression ratio of files from CNN using
different algorithms.
Table 10 The distribution of data blocks using different
algorithms.
server1 server2 server3 server4 server5
SAP-Space-Delta 7536 9988 11 848 14 362 20 721
SAP-Space-Dedup 6843 9055 10 521 12 560 17 606
FSAP1 10 582 11 267 12 635 11 122 10 567
FSAP2 9196 10 012 11 138 9565 9180
11.65%, and the SAP-Space-Dedup algorithm achieves
a compression ratio of 22.44%. The compression ratio
of CNN filed using the FSAP1 algorithm is obviously
higher than the compression ratio of CNN filed using
the SAP-Space-Delta algorithm, and it is slightly higher
than the SAP-Space-Dedup algorithm. Because each
of the test web files downloaded from CNN are
approximately the same size, the curve of the FSAP1
algorithm and the SAP-Space-Dedup algorithm are very
similar. Even in this case, the compression ratio of the
FSAP1 algorithm is still higher than the compression
ratio of the SAP-Space-Dedup algorithm.
From the test results, we can find that (1) in the
case of a fixed number of servers, as the number
of files increases, the compression ratio gradually
increases; (2) if the number of servers increases, then
the compression ratio will decrease. As an analogy to
the actual situation, if the capacity of each server is
sufficient, then more files will be stored, and a higher
compression ratio will be achieved. Conversely, if the
capacity of a server is insufficient, then we need to
increase the number of servers, and the compression
ratio will decline.
From Table 10, we can find that the distribution
of data blocks using the SAP-Space-Delta algorithm
and the SAP-Space-Dedup algorithm are seriously
imbalanced. The number of data blocks on different
servers are obviously different. But the distribution
of data blocks using the FSAP1 algorithm and the
FSAP2 algorithm is basically the same. The test results
show that the FSAP algorithm achieves an even greater
distribution of data blocks than the SAP algorithm.
In general, the test results show that the FSAP
algorithm proposed in this paper is better than the
SAP algorithm. Further, the compression ratio using
the FSAP2 algorithm is far greater than that using
the FSAP1 algorithm, which also shows that the
FSAP algorithm has a significant effect on redundancy
removal of data blocks among different servers.
5.2 The test of Space-Time utility maximization
model
In order to test Space-Time utility maximization model,
we perform experiments on a set of 100 random web
files downloaded from CNN using the FSAP algorithm.
The number of servers is fixed at 5, and each server is
configured as dual-core CPU and 4 GB memory. We
define a variable — threshold. For the redundant data
blocks, if the occurrence frequency on one server is
smaller than the threshold value, these redundant blocks
on this server will be removed (under the premise of
retain at least one). If the frequency on each server is
larger than the threshold, these redundant blocks will
not be removed, while the frequency is equal to the
threshold, we only retain the redundant blocks on a
single server. As the maximum occurrence frequency
is 7 or 8[10]
, we take the maximum threshold value is
10. We observe how S and T change over the threshold
value of redundancy removal. The changes of S and T
will directly effect P .
The test procedure is as follows:
(1) To obtain the change curves of S over the
threshold value of redundancy removal. S is the total
number of data blocks retained on 5 storage servers.
(2) To obtain the change curves of T over the
threshold value of redundancy removal. T is expressed
by the results of numerical simulation. Its calculation
standard is as follows: if one file is completely stored
in one server, its access time is the number of data
blocks in this file. After deep redundancy removal of
data blocks, when one data block is removed, T will be
increased by 1 because of the need to obtain the data
block from other servers.
(3) The dimensions of S and T are different; we need
to unify them. It is an important step in the test of the
Space-Time utility maximization model. The method
www.redpel.com +917620593389
www.redpel.com +917620593389
242 Tsinghua Science and Technology, June 2015, 20(3): 233-245
adopted in this paper is as follows:
S0
D .1=S min.1=S//=.max.1=S/ min.1=S//;
T 0
D .1=T min.1=T //=.max.1=T / min.1=T //:
(4) To obtain the change curves of P over the
threshold value of redundancy removal. Now,
P D ˛ S0
C .1 ˛/ T 0
; 0 ˛ 1:
The test results are shown in Table 11 and Figs. 6
and 7. From the test results, we can find that as the
threshold value of redundancy removal increases, data
space usage S decreases; meanwhile, data access time
T increases. When the threshold value of redundancy
removal is 2, 3, and 4, data space usage S significantly
reduces and data access time T obviously increases. At
last, data space usage and data access time tend to be a
constant values, which means that at this time there are
no redundant data blocks among servers. At the same
time, Table 12 and Fig. 8 show that P can always obtain
the extremum value in the range of threshold. When
˛ D 0, P gets the maximum value at the point of the
minimum threshold; when ˛ D 1, P gets the maximum
value at the point of the maximum threshold. When
˛ changes, P gets the maximum value at different
thresholds. The curves of P show that space-time utility
maximization model is reasonable. Additionally, using
the Space-Time maximization model proposed in this
paper, by observing the change of P with the threshold
value of redundancy removal, we can achieve the best
balance between data space usage and data access time.
6 Related Work
Data de-duplication, as a compression method, has
been widely used in most backup systems to improve
bandwidth and space efficiency[14]
. This technology can
remove redundant data in the data stream. The finer the
grain size, the more the redundant data removed, while
more computing resources are used at the same time.
De-duplication technology has already received a lot of
attention in the industry.
Single Instance Storage (SIS) on the Microsoft
Windows Server2000 achieves the same file stored only
once, while through forging links to provide a copy of
the file system[15]
. This method is simple, easy to use,
and is more suitable for the collection of small files, but
it cannot detect the same data block within a different
Table 11 The test results of S and T (N is the number of servers.).
Threshold
N D 2 N D 3 N D 4 N D 5
S T S T S T S T
1 49 996 49 996 52 910 52 910 54 832 54 832 56 059 56 059
2 45 783 54 209 46 177 59 643 46 353 63 311 46 419 65 699
3 45 023 55 729 45 118 61 761 45 108 65 801 45 101 68 335
4 44 710 56 668 44 751 62 862 44 751 66 872 44 727 69 457
5 44 577 57 200 44 599 63 470 44573 67 584 44 548 70 173
6 44 504 57 565 44 517 63 880 44499 67 954 44 481 70 508
7 44 461 57 823 44 469 64 168 44448 68 260 44 434 70 790
8 44 429 58 047 44 438 64 385 44420 68 456 44 406 70 986
9 44 413 58 175 44 414 64 577 44403 68 592 44 384 711 62
10 44 399 58 301 44 401 64 694 44 383 68 772 44 366 71 324
Fig. 6 The relationship between S and the threshold value
of redundancy removal. N is the number of servers.
Fig. 7 The relationship between T and the threshold value
of redundancy removal. N is the number of servers.
www.redpel.com +917620593389
www.redpel.com +917620593389
Jianjiang Li et al.: Frequency and Similarity-Aware Partitioning for Cloud Storage 243
Table 12 The value of P in different ˛˛˛.
Thre- P
shold ˛=0 ˛=0.2 ˛=0.5 ˛=0.6 ˛=0.8 ˛=1
1 1 0.8 0.5 0.4 0.2 0
2 0.314 42 0.409 13 0.551 19 0.598 55 0.693 25 0.787 96
3 0.160 63 0.312 88 0.541 25 0.617 37 0.769 62 0.921 87
4 0.098 71 0.271 23 0.530 01 0.616 27 0.788 79 0.961 31
5 0.060 24 0.244 27 0.520 32 0.612 34 0.796 38 0.980 41
6 0.042 50 0.231 52 0.515 05 0.609 56 0.798 58 0.987 61
7 0.027 70 0.220 70 0.510 18 0.606 68 0.799 67 0.992 66
8 0.017 49 0.213 13 0.506 58 0.604 40 0.800 04 0.995 68
9 0.008 36 0.206 30 0.503 20 0.602 18 0.800 12 0.998 06
10 0 0.2 0.5 0.6 0.8 1
Fig. 8 The relationship between P and the threshold value
of redundancy removal.
file. In order to delete duplicate data in the data
block level, the fixed-size block method was first
designed[16]
. In this method, each file is divided into
non-overlapping blocks of fixed size, and each data
block has a hash value. Bell laboratory’s Venti[8]
file
storage system saves approximately 30% of the storage
space using this method. IBM applied this technology
to the online TotalStorage SAN File System[16]
. In
addition, the virtual machine migration also uses
the fixed-length block divided method to improve
the migration performance and reduce the network
cost[17]
. The technology of using this algorithm is very
simple. The efficiency is also very good, but the changes
to a file location are very sensitive, and the file header
inserting a new byte can cause all data blocks to form
into a new block.
Low-Bandwidth File System (LBFS)[18]
using the
Rabin fingerprint[19, 20]
instead of the fixed size divided
the file into a variable block size. This method is
called block divided method based on content, it makes
the storage system no longer sensitive to position
modifications. Network buffer system based on value,
DDFS of Data Domain, the P2P file system of Pasta,
and backup system of Pastiche also used this technology
to identify and eliminate the redundant data. However,
this algorithm has uncertainty on the definition of data
block boundary, and has a higher overhead. Besides
these algorithms, some techniques, such as delta-
encoding[21]
, Bloomfilter[22]
, and sparse index[23]
, were
also used to enhance the performance of de-duplication
technology.
7 Conclusions
In this paper, we discuss the problem of redundant data
in the cloud storage systems and study the main de-
duplication technologies. The main goal of this paper is
how to use less space to store more data files, and at the
same time, not obviously affect access efficiency. Based
on the study of other data de-duplication technologies
(especially the SAP algorithm), we first propose a more
reasonable data partitioning algorithm — FSAP, then
present the Space-Time maximization model; finally,
we compare the utilization rate of storage space of
the FSAP algorithm with that of the SAP algorithm,
and test the Space-Time utility maximization model
through experiments on a set of 100 random web files
downloaded from CNN.
The FSAP algorithm overcomes the shortcomings of
the SAP algorithm that calculates the similarity between
two files that are not reasonable, and also reduces
the computational complexity and the space occupied.
The experimental results on files from CNN show that
relative to using the SAP-Space-Delta algorithm and
the SAP-Space-Dedup algorithm, the compression ratio
using the FSAP algorithm respectively increased by
17.7% and 6.91%. At the same time, the test results
show that the FSAP algorithm achieves a more even
distribution of data blocks than the SAP algorithm.
These facts show that the FSAP algorithm proposed
in this paper is much more efficient than the SAP
algorithm.
Data de-duplication does not mean that there is no
data redundancy; proper data redundancy can reduce
access time and increase the reliability of data access.
Using the space-time maximization model proposed
in this paper, by observing the change of P with the
threshold value of redundancy removal, we can achieve
the best balance between data space usage and data
access time.
www.redpel.com +917620593389
www.redpel.com +917620593389
244 Tsinghua Science and Technology, June 2015, 20(3): 233-245
Acknowledgements
This work was supported by the National High-Tech
Research and Development (863) Program of China (No.
2015AA01A303).
References
[1] T. Benson, A. Akella, and D. A. Maltz, Network traffic
characteristics of data centers in the wild, in Proceedings
of the 10th ACM SIGCOMM Conference on Internet
Measurement, ACM, 2010, pp. 267–280.
[2] D. T. Meyer and W. J. Bolosky, A study of practical
deduplication, ACM Transactions on Storage (TOS), vol.
7, no. 4, p. 14, 2012.
[3] A. T. Clements, I. Ahmad, M. Vilayannur, and J. Li,
Decentralized deduplication in san cluster file systems, in
USENIX Annual Technical Conference, 2009, pp. 101–114.
[4] U. Manber, Finding similar files in a large file system, in
Usenix Winter 1994 Technical Conference, 1994, pp. 1–10.
[5] B. S. Baker, On finding duplication and near-duplication
in large software systems, in Reverse Engineering, 1995,
Proceedings of 2nd Working Conference on, IEEE, 1995,
pp. 86–95.
[6] G. Forman, K. Eshghi, and S. Chiocchetti, Finding similar
files in large document repositories, in Proceedings of
the Eleventh ACM SIGKDD International Conference on
Knowledge Discovery in Data Mining, ACM, 2005, pp.
394–400.
[7] J. R. Douceur, A. Adya, W. J. Bolosky, P. Simon, and
M. Theimer, Reclaiming space from duplicate files in a
serverless distributed file system, in Distributed Computing
Systems, 2002. Proceedings. 22nd International
Conference on, IEEE, 2002, pp. 617–624.
[8] S. Quinlan and S. Dorward, Venti: A new approach to
archival storage, FAST, vol. 2, pp. 89–101, 2002.
[9] B. Zhu, K. Li, and R. H. Patterson, Avoiding the disk
bottleneck in the data domain deduplication file system,
FAST, vol. 8, pp. 1–14, 2008.
[10] B. Balasubramanian, T. Lan, and M. Chiang, Sap:
Similarity aware partitioning for efficient cloud storage, in
Infocom 2014 Proceedings IEEE, 2014, pp. 592–600.
[11] D. Bhagwat, K. Pollack, D. D. Long, T. Schwarz, E.
L. Miller, and J. F. Paris, Providing high reliability
in a minimum redundancy archival storage system, in
Modeling, Analysis, and Simulation of Computer and
Telecommunication Systems, 2006. MASCOTS 2006. 14th
IEEE International Symposium on, IEEE, 2006, pp. 413–
421.
[12] T. J. Schwarz, Q. Xin, E. L. Miller, D. D. Long, A.
Hospodor, and S. Ng, Disk scrubbing in large archival
storage systems, in Modeling, Analysis, and Simulation
of Computer and Telecommunications Systems, 2004.
(MASCOTS 2004). Proceedings. The IEEE Computer
Societys 12th Annual International Symposium on, IEEE,
2004, pp. 409–418.
[13] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. Long, and C.
Maltzahn, Ceph: A scalable, high-performance distributed
file system, in Proceedings of the 7th Symposium on
Operating Systems Design and Implementation, USENIX
Association, 2006, pp. 307–320.
[14] R. Zhu, L.-H. Qin, J.-L. Zhou, and H. Zheng, Using
multithreads to hide deduplication i/o latency with low
synchronization overhead, Journal of Central South
University, vol. 20, pp. 1582–1591, 2013.
[15] W. J. Bolosky, S. Corbin, D. Goebel, and J. R. Douceur,
Single instance storage in windows 2000, in Proceedings
of the 4th USENIX Windows Systems Symposium, Seattle,
USA, 2000, pp. 13–24.
[16] B. Hong, D. Plantenberg, D. D. Long, and M. Sivan-
Zimet, Duplicate data elimination in a san file system,
in Proceedings of the 21st IEEE / 12th NASA Goddard
Conference on Mass Storage Systems and Technologies,
2004, pp. 301–314.
[17] C. P. Sapuntzakis, R. Chandra, B. Pfaff, J. Chow, M.
S. Lam, and M. Rosenblum, Optimizing the migration
of virtual computers, ACM SIGOPS Operating Systems
Review, vol. 36, no. SI, pp. 377–390, 2002.
[18] A. Muthitacharoen, B. Chen, and D. Mazieres, A low
bandwidth network file system, ACM SIGOPS Operating
Systems Review, vol. 35, no. 5, pp. 174–187, 2001.
[19] M. O. Rabin, Fingerprinting by random polynomials,
Center for Research in Computing Technology, Harvard
University, Report TR-15-81, 1981.
[20] A. Z. Broder, Some applications of rabins fingerprinting
method, in Sequences II. Springer, 1993, pp. 143–152.
[21] F. Douglis and A. Iyengar, Application-specific
deltaencoding via resemblance detection. in USENIX
Annual Technical Conference, General Track, 2003, pp.
113–126.
[22] B. H. Bloom, Space/time trade-offs in hash coding with
allowable errors, Communications of the ACM, vol. 13, no.
7, pp. 422–426, 1970.
[23] M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G.
Trezis, and P. Camble, Sparse indexing: Large scale, inline
deduplication using sampling and locality, in 7th USENIX
Conference on File and Storage Technologies, 2009, pp.
111–123.
www.redpel.com +917620593389
www.redpel.com +917620593389
Jianjiang Li et al.: Frequency and Similarity-Aware Partitioning for Cloud Storage 245
Jie Wu is the chair and a Laura H.
Carnell Professor in the Department of
Computer and Information Sciences at
Temple University. Prior to joining Temple
University, USA, he was a program
director at the National Science Foundation
and a distinguished professor at Florida
Atlantic University. He received his PhD
degree from Florida Atlantic University in 1989. His current
research interests include mobile computing and wireless
networks, routing protocols, cloud and green computing,
network trust and security, and social network applications.
Dr. Wu regularly published in scholarly journals, conference
proceedings, and books. He serves on several editorial boards,
including IEEE Transactions on Computers, IEEE Transactions
on Service Computing, and Journal of Parallel and Distributed
Computing. Dr. Wu was general co-chair/chair for IEEE MASS
2006 and IEEE IPDPS 2008 and program co-chair for IEEE
INFOCOM 2011. Currently, he is serving as general chair
for IEEE ICDCS 2013 and ACM MobiHoc 2014, and program
chair for CCF CNCC 2013. He was an IEEE Computer Society
Distinguished Visitor, ACM Distinguished Speaker, and chair
for the IEEE Technical Committee on Distributed Processing
(TCDP). Dr. Wu is a CCF Distinguished Speaker and a Fellow
of the IEEE. He is the recipient of the 2011 China Computer
Federation (CCF) Overseas Outstanding Achievement Award.
Jianjiang Li is currently an associate
professor at University of Science and
Technology Beijing, China. He received
his PhD degree in computer science from
Tsinghua University in 2005. He was a
visiting scholar at Temple University from
Jan. 2014 to Jan. 2015. His current research
interests include parallel computing, cloud
computing, and parallel compilation.
Zhanning Ma is currently a master
student in University of Science and
Technology Beijing for his degree. His
research interests include distributed
storage technology and data duplicate
removal. He received his bachelor’s degree
in 2010 from Taishan Medical University.
www.redpel.com +917620593389
www.redpel.com +917620593389

Weitere ähnliche Inhalte

Was ist angesagt?

IRJET- Efficient Geometric Range Search on RTREE Occupying Encrypted Spatial ...
IRJET- Efficient Geometric Range Search on RTREE Occupying Encrypted Spatial ...IRJET- Efficient Geometric Range Search on RTREE Occupying Encrypted Spatial ...
IRJET- Efficient Geometric Range Search on RTREE Occupying Encrypted Spatial ...IRJET Journal
 
Literature Survey on Buliding Confidential and Efficient Query Processing Usi...
Literature Survey on Buliding Confidential and Efficient Query Processing Usi...Literature Survey on Buliding Confidential and Efficient Query Processing Usi...
Literature Survey on Buliding Confidential and Efficient Query Processing Usi...paperpublications3
 
An Efficient PDP Scheme for Distributed Cloud Storage
An Efficient PDP Scheme for Distributed Cloud StorageAn Efficient PDP Scheme for Distributed Cloud Storage
An Efficient PDP Scheme for Distributed Cloud StorageIJMER
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachAIRCC Publishing Corporation
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
A Survey: Hybrid Job-Driven Meta Data Scheduling for Data storage with Intern...
A Survey: Hybrid Job-Driven Meta Data Scheduling for Data storage with Intern...A Survey: Hybrid Job-Driven Meta Data Scheduling for Data storage with Intern...
A Survey: Hybrid Job-Driven Meta Data Scheduling for Data storage with Intern...dbpublications
 
Using BIG DATA implementations onto Software Defined Networking
Using BIG DATA implementations onto Software Defined NetworkingUsing BIG DATA implementations onto Software Defined Networking
Using BIG DATA implementations onto Software Defined NetworkingIJCSIS Research Publications
 
Keysum - Using Checksum Keys
Keysum - Using Checksum KeysKeysum - Using Checksum Keys
Keysum - Using Checksum KeysDavid Walker
 
Turn InSecure And High Speed Intra-Cloud and Inter-Cloud Communication
Turn InSecure And High Speed Intra-Cloud and Inter-Cloud CommunicationTurn InSecure And High Speed Intra-Cloud and Inter-Cloud Communication
Turn InSecure And High Speed Intra-Cloud and Inter-Cloud CommunicationRichard Jung
 
Applications of SOA and Web Services in Grid Computing
Applications of SOA and Web Services in Grid ComputingApplications of SOA and Web Services in Grid Computing
Applications of SOA and Web Services in Grid Computingyht4ever
 
Enabling efficient multi keyword ranked
Enabling efficient multi keyword rankedEnabling efficient multi keyword ranked
Enabling efficient multi keyword rankedSakthi Sundaram
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
A tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big dataA tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big dataredpel dot com
 
NEW SECURE CONCURRECY MANEGMENT APPROACH FOR DISTRIBUTED AND CONCURRENT ACCES...
NEW SECURE CONCURRECY MANEGMENT APPROACH FOR DISTRIBUTED AND CONCURRENT ACCES...NEW SECURE CONCURRECY MANEGMENT APPROACH FOR DISTRIBUTED AND CONCURRENT ACCES...
NEW SECURE CONCURRECY MANEGMENT APPROACH FOR DISTRIBUTED AND CONCURRENT ACCES...ijiert bestjournal
 
Cloak-Reduce Load Balancing Strategy for Mapreduce
Cloak-Reduce Load Balancing Strategy for MapreduceCloak-Reduce Load Balancing Strategy for Mapreduce
Cloak-Reduce Load Balancing Strategy for MapreduceAIRCC Publishing Corporation
 
Iaetsd secured and efficient data scheduling of intermediate data sets
Iaetsd secured and efficient data scheduling of intermediate data setsIaetsd secured and efficient data scheduling of intermediate data sets
Iaetsd secured and efficient data scheduling of intermediate data setsIaetsd Iaetsd
 
High level view of cloud security
High level view of cloud securityHigh level view of cloud security
High level view of cloud securitycsandit
 

Was ist angesagt? (20)

IRJET- Efficient Geometric Range Search on RTREE Occupying Encrypted Spatial ...
IRJET- Efficient Geometric Range Search on RTREE Occupying Encrypted Spatial ...IRJET- Efficient Geometric Range Search on RTREE Occupying Encrypted Spatial ...
IRJET- Efficient Geometric Range Search on RTREE Occupying Encrypted Spatial ...
 
Literature Survey on Buliding Confidential and Efficient Query Processing Usi...
Literature Survey on Buliding Confidential and Efficient Query Processing Usi...Literature Survey on Buliding Confidential and Efficient Query Processing Usi...
Literature Survey on Buliding Confidential and Efficient Query Processing Usi...
 
An Efficient PDP Scheme for Distributed Cloud Storage
An Efficient PDP Scheme for Distributed Cloud StorageAn Efficient PDP Scheme for Distributed Cloud Storage
An Efficient PDP Scheme for Distributed Cloud Storage
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis Approach
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
A Survey: Hybrid Job-Driven Meta Data Scheduling for Data storage with Intern...
A Survey: Hybrid Job-Driven Meta Data Scheduling for Data storage with Intern...A Survey: Hybrid Job-Driven Meta Data Scheduling for Data storage with Intern...
A Survey: Hybrid Job-Driven Meta Data Scheduling for Data storage with Intern...
 
Using BIG DATA implementations onto Software Defined Networking
Using BIG DATA implementations onto Software Defined NetworkingUsing BIG DATA implementations onto Software Defined Networking
Using BIG DATA implementations onto Software Defined Networking
 
Keysum - Using Checksum Keys
Keysum - Using Checksum KeysKeysum - Using Checksum Keys
Keysum - Using Checksum Keys
 
Turn InSecure And High Speed Intra-Cloud and Inter-Cloud Communication
Turn InSecure And High Speed Intra-Cloud and Inter-Cloud CommunicationTurn InSecure And High Speed Intra-Cloud and Inter-Cloud Communication
Turn InSecure And High Speed Intra-Cloud and Inter-Cloud Communication
 
Applications of SOA and Web Services in Grid Computing
Applications of SOA and Web Services in Grid ComputingApplications of SOA and Web Services in Grid Computing
Applications of SOA and Web Services in Grid Computing
 
Enabling efficient multi keyword ranked
Enabling efficient multi keyword rankedEnabling efficient multi keyword ranked
Enabling efficient multi keyword ranked
 
B1803031217
B1803031217B1803031217
B1803031217
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
A tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big dataA tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big data
 
NEW SECURE CONCURRECY MANEGMENT APPROACH FOR DISTRIBUTED AND CONCURRENT ACCES...
NEW SECURE CONCURRECY MANEGMENT APPROACH FOR DISTRIBUTED AND CONCURRENT ACCES...NEW SECURE CONCURRECY MANEGMENT APPROACH FOR DISTRIBUTED AND CONCURRENT ACCES...
NEW SECURE CONCURRECY MANEGMENT APPROACH FOR DISTRIBUTED AND CONCURRENT ACCES...
 
Cloak-Reduce Load Balancing Strategy for Mapreduce
Cloak-Reduce Load Balancing Strategy for MapreduceCloak-Reduce Load Balancing Strategy for Mapreduce
Cloak-Reduce Load Balancing Strategy for Mapreduce
 
A0360109
A0360109A0360109
A0360109
 
Iaetsd secured and efficient data scheduling of intermediate data sets
Iaetsd secured and efficient data scheduling of intermediate data setsIaetsd secured and efficient data scheduling of intermediate data sets
Iaetsd secured and efficient data scheduling of intermediate data sets
 
B0330811
B0330811B0330811
B0330811
 
High level view of cloud security
High level view of cloud securityHigh level view of cloud security
High level view of cloud security
 

Ähnlich wie Frequency and similarity aware partitioning for cloud storage based on space time utility maximization model

An Efficient and Fault Tolerant Data Replica Placement Technique for Cloud ba...
An Efficient and Fault Tolerant Data Replica Placement Technique for Cloud ba...An Efficient and Fault Tolerant Data Replica Placement Technique for Cloud ba...
An Efficient and Fault Tolerant Data Replica Placement Technique for Cloud ba...IJCSIS Research Publications
 
Efficient Record De-Duplication Identifying Using Febrl Framework
Efficient Record De-Duplication Identifying Using Febrl FrameworkEfficient Record De-Duplication Identifying Using Febrl Framework
Efficient Record De-Duplication Identifying Using Febrl FrameworkIOSR Journals
 
Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudIJAAS Team
 
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATIONA BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATIONcscpconf
 
IEEE Parallel and distributed system 2016 Title and Abstract
IEEE Parallel and distributed system 2016 Title and AbstractIEEE Parallel and distributed system 2016 Title and Abstract
IEEE Parallel and distributed system 2016 Title and Abstracttsysglobalsolutions
 
A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...Tao Feng
 
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storageI-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storageredpel dot com
 
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...IRJET Journal
 
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTING
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTINGREPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTING
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTINGcsandit
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...dbpublications
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Coursejimliddle
 
Towards a new hybrid approach for building documentoriented data wareh
Towards a new hybrid approach for building documentoriented data warehTowards a new hybrid approach for building documentoriented data wareh
Towards a new hybrid approach for building documentoriented data warehIJECEIAES
 

Ähnlich wie Frequency and similarity aware partitioning for cloud storage based on space time utility maximization model (20)

ICICCE0298
ICICCE0298ICICCE0298
ICICCE0298
 
An Efficient and Fault Tolerant Data Replica Placement Technique for Cloud ba...
An Efficient and Fault Tolerant Data Replica Placement Technique for Cloud ba...An Efficient and Fault Tolerant Data Replica Placement Technique for Cloud ba...
An Efficient and Fault Tolerant Data Replica Placement Technique for Cloud ba...
 
50120130406035
5012013040603550120130406035
50120130406035
 
Efficient Record De-Duplication Identifying Using Febrl Framework
Efficient Record De-Duplication Identifying Using Febrl FrameworkEfficient Record De-Duplication Identifying Using Febrl Framework
Efficient Record De-Duplication Identifying Using Febrl Framework
 
Facade
FacadeFacade
Facade
 
SiDe Enabled Reliable Replica Optimization
SiDe Enabled Reliable Replica OptimizationSiDe Enabled Reliable Replica Optimization
SiDe Enabled Reliable Replica Optimization
 
Fault tolerance on cloud computing
Fault tolerance on cloud computingFault tolerance on cloud computing
Fault tolerance on cloud computing
 
Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with Cloud
 
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATIONA BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
 
IEEE Parallel and distributed system 2016 Title and Abstract
IEEE Parallel and distributed system 2016 Title and AbstractIEEE Parallel and distributed system 2016 Title and Abstract
IEEE Parallel and distributed system 2016 Title and Abstract
 
A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...
 
Tombolo
TomboloTombolo
Tombolo
 
Resisting skew accumulation
Resisting skew accumulationResisting skew accumulation
Resisting skew accumulation
 
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storageI-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
 
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...
 
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTING
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTINGREPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTING
REPLICATION STRATEGY BASED ON DATA RELATIONSHIP IN GRID COMPUTING
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Course
 
Towards a new hybrid approach for building documentoriented data wareh
Towards a new hybrid approach for building documentoriented data warehTowards a new hybrid approach for building documentoriented data wareh
Towards a new hybrid approach for building documentoriented data wareh
 
50120130405014 2-3
50120130405014 2-350120130405014 2-3
50120130405014 2-3
 

Mehr von redpel dot com

An efficient tree based self-organizing protocol for internet of things
An efficient tree based self-organizing protocol for internet of thingsAn efficient tree based self-organizing protocol for internet of things
An efficient tree based self-organizing protocol for internet of thingsredpel dot com
 
Validation of pervasive cloud task migration with colored petri net
Validation of pervasive cloud task migration with colored petri netValidation of pervasive cloud task migration with colored petri net
Validation of pervasive cloud task migration with colored petri netredpel dot com
 
Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...
Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...
Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...redpel dot com
 
Towards a virtual domain based authentication on mapreduce
Towards a virtual domain based authentication on mapreduceTowards a virtual domain based authentication on mapreduce
Towards a virtual domain based authentication on mapreduceredpel dot com
 
Protection of big data privacy
Protection of big data privacyProtection of big data privacy
Protection of big data privacyredpel dot com
 
Multiagent multiobjective interaction game system for service provisoning veh...
Multiagent multiobjective interaction game system for service provisoning veh...Multiagent multiobjective interaction game system for service provisoning veh...
Multiagent multiobjective interaction game system for service provisoning veh...redpel dot com
 
Efficient multicast delivery for data redundancy minimization over wireless d...
Efficient multicast delivery for data redundancy minimization over wireless d...Efficient multicast delivery for data redundancy minimization over wireless d...
Efficient multicast delivery for data redundancy minimization over wireless d...redpel dot com
 
Cloud assisted io t-based scada systems security- a review of the state of th...
Cloud assisted io t-based scada systems security- a review of the state of th...Cloud assisted io t-based scada systems security- a review of the state of th...
Cloud assisted io t-based scada systems security- a review of the state of th...redpel dot com
 
Bayes based arp attack detection algorithm for cloud centers
Bayes based arp attack detection algorithm for cloud centersBayes based arp attack detection algorithm for cloud centers
Bayes based arp attack detection algorithm for cloud centersredpel dot com
 
Architecture harmonization between cloud radio access network and fog network
Architecture harmonization between cloud radio access network and fog networkArchitecture harmonization between cloud radio access network and fog network
Architecture harmonization between cloud radio access network and fog networkredpel dot com
 
Analysis of classical encryption techniques in cloud computing
Analysis of classical encryption techniques in cloud computingAnalysis of classical encryption techniques in cloud computing
Analysis of classical encryption techniques in cloud computingredpel dot com
 
An anomalous behavior detection model in cloud computing
An anomalous behavior detection model in cloud computingAn anomalous behavior detection model in cloud computing
An anomalous behavior detection model in cloud computingredpel dot com
 
A parallel patient treatment time prediction algorithm and its applications i...
A parallel patient treatment time prediction algorithm and its applications i...A parallel patient treatment time prediction algorithm and its applications i...
A parallel patient treatment time prediction algorithm and its applications i...redpel dot com
 
A mobile offloading game against smart attacks
A mobile offloading game against smart attacksA mobile offloading game against smart attacks
A mobile offloading game against smart attacksredpel dot com
 
A distributed video management cloud platform using hadoop
A distributed video management cloud platform using hadoopA distributed video management cloud platform using hadoop
A distributed video management cloud platform using hadoopredpel dot com
 
A deep awareness framework for pervasiv video cloud
A deep awareness framework for pervasiv video cloudA deep awareness framework for pervasiv video cloud
A deep awareness framework for pervasiv video cloudredpel dot com
 
A cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataA cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataredpel dot com
 
A cloud gaming system based on user level virtualization and its resource sch...
A cloud gaming system based on user level virtualization and its resource sch...A cloud gaming system based on user level virtualization and its resource sch...
A cloud gaming system based on user level virtualization and its resource sch...redpel dot com
 
Software defined networking with pseudonym systems for secure vehicular clouds
Software defined networking with pseudonym systems for secure vehicular cloudsSoftware defined networking with pseudonym systems for secure vehicular clouds
Software defined networking with pseudonym systems for secure vehicular cloudsredpel dot com
 
Top k query processing and malicious node identification based on node groupi...
Top k query processing and malicious node identification based on node groupi...Top k query processing and malicious node identification based on node groupi...
Top k query processing and malicious node identification based on node groupi...redpel dot com
 

Mehr von redpel dot com (20)

An efficient tree based self-organizing protocol for internet of things
An efficient tree based self-organizing protocol for internet of thingsAn efficient tree based self-organizing protocol for internet of things
An efficient tree based self-organizing protocol for internet of things
 
Validation of pervasive cloud task migration with colored petri net
Validation of pervasive cloud task migration with colored petri netValidation of pervasive cloud task migration with colored petri net
Validation of pervasive cloud task migration with colored petri net
 
Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...
Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...
Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...
 
Towards a virtual domain based authentication on mapreduce
Towards a virtual domain based authentication on mapreduceTowards a virtual domain based authentication on mapreduce
Towards a virtual domain based authentication on mapreduce
 
Protection of big data privacy
Protection of big data privacyProtection of big data privacy
Protection of big data privacy
 
Multiagent multiobjective interaction game system for service provisoning veh...
Multiagent multiobjective interaction game system for service provisoning veh...Multiagent multiobjective interaction game system for service provisoning veh...
Multiagent multiobjective interaction game system for service provisoning veh...
 
Efficient multicast delivery for data redundancy minimization over wireless d...
Efficient multicast delivery for data redundancy minimization over wireless d...Efficient multicast delivery for data redundancy minimization over wireless d...
Efficient multicast delivery for data redundancy minimization over wireless d...
 
Cloud assisted io t-based scada systems security- a review of the state of th...
Cloud assisted io t-based scada systems security- a review of the state of th...Cloud assisted io t-based scada systems security- a review of the state of th...
Cloud assisted io t-based scada systems security- a review of the state of th...
 
Bayes based arp attack detection algorithm for cloud centers
Bayes based arp attack detection algorithm for cloud centersBayes based arp attack detection algorithm for cloud centers
Bayes based arp attack detection algorithm for cloud centers
 
Architecture harmonization between cloud radio access network and fog network
Architecture harmonization between cloud radio access network and fog networkArchitecture harmonization between cloud radio access network and fog network
Architecture harmonization between cloud radio access network and fog network
 
Analysis of classical encryption techniques in cloud computing
Analysis of classical encryption techniques in cloud computingAnalysis of classical encryption techniques in cloud computing
Analysis of classical encryption techniques in cloud computing
 
An anomalous behavior detection model in cloud computing
An anomalous behavior detection model in cloud computingAn anomalous behavior detection model in cloud computing
An anomalous behavior detection model in cloud computing
 
A parallel patient treatment time prediction algorithm and its applications i...
A parallel patient treatment time prediction algorithm and its applications i...A parallel patient treatment time prediction algorithm and its applications i...
A parallel patient treatment time prediction algorithm and its applications i...
 
A mobile offloading game against smart attacks
A mobile offloading game against smart attacksA mobile offloading game against smart attacks
A mobile offloading game against smart attacks
 
A distributed video management cloud platform using hadoop
A distributed video management cloud platform using hadoopA distributed video management cloud platform using hadoop
A distributed video management cloud platform using hadoop
 
A deep awareness framework for pervasiv video cloud
A deep awareness framework for pervasiv video cloudA deep awareness framework for pervasiv video cloud
A deep awareness framework for pervasiv video cloud
 
A cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataA cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring data
 
A cloud gaming system based on user level virtualization and its resource sch...
A cloud gaming system based on user level virtualization and its resource sch...A cloud gaming system based on user level virtualization and its resource sch...
A cloud gaming system based on user level virtualization and its resource sch...
 
Software defined networking with pseudonym systems for secure vehicular clouds
Software defined networking with pseudonym systems for secure vehicular cloudsSoftware defined networking with pseudonym systems for secure vehicular clouds
Software defined networking with pseudonym systems for secure vehicular clouds
 
Top k query processing and malicious node identification based on node groupi...
Top k query processing and malicious node identification based on node groupi...Top k query processing and malicious node identification based on node groupi...
Top k query processing and malicious node identification based on node groupi...
 

Kürzlich hochgeladen

Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 

Kürzlich hochgeladen (20)

Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 

Frequency and similarity aware partitioning for cloud storage based on space time utility maximization model

  • 1. TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll02/10llpp233-245 Volume 20, Number 3, June 2015 Frequency and Similarity-Aware Partitioning for Cloud Storage Based on Space-Time Utility Maximization Model Jianjiang Li, Jie Wu, and Zhanning Ma Abstract: With the rise of various cloud services, the problem of redundant data is more prominent in the cloud storage systems. How to assign a set of documents to a distributed file system, which can not only reduce storage space, but also ensure the access efficiency as much as possible, is an urgent problem which needs to be solved. Space-efficiency mainly uses data de-duplication technologies, while access-efficiency requires gathering the files with high similarity on a server. Based on the study of other data de-duplication technologies, especially the Similarity-Aware Partitioning (SAP) algorithm, this paper proposes the Frequency and Similarity-Aware Partitioning (FSAP) algorithm for cloud storage. The FSAP algorithm is a more reasonable data partitioning algorithm than the SAP algorithm. Meanwhile, this paper proposes the Space-Time Utility Maximization Model (STUMM), which is useful in balancing the relationship between space-efficiency and access-efficiency. Finally, this paper uses 100 web files downloaded from CNN for testing, and the results show that, relative to using the algorithms associated with the SAP algorithm (including the SAP-Space-Delta algorithm and the SAP-Space-Dedup algorithm), the FSAP algorithm based on STUMM reaches higher compression ratio and a more balanced distribution of data blocks. Key words: de-duplication; cloud storage; redundancy; frequency 1 Introduction Cloud computing has become an important part of the information technology industry in recent times. The number of cloud storage services is increasing with most vendors providing a variable amount of free storage, such as Facebook, Netflix, and Dropbox; therefore, data volumes are increasing at an unprecedented rate. These cloud-based services must solve two key issues: (1) They should be access- Jianjiang Li and Zhanning Ma are with the Department of Computer Science and Technology, University of Science and Technology Beijing, Beijing 100083, China. E-mail: lijianjiang@ustb.edu.cn; ningzhanma@163.com. Jie Wu is with the Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA. E-mail: jiewu@temple.edu. To whom correspondence should be addressed. Manuscript received: 2015-03-16; accepted: 2015-05-04 efficient to read or write the files in the system in terms of minimizing network accesses. This is a crucial factor with increasing network congestion in the underlying data center networks[1, 2] . (2) They should be space- efficient to manage the high volumes of data[3] . The continuous integration of global information leads to more and more high-value data. The exponential growth of high-value data leads to many challenges to the enterprise’s IT departments. The explosive data growth tends to consume a large amount of storage space of the data center, then forces the enterprises to buy more storage facilities to upgrade. At the same time, mass data is also a considerable challenge for network transmission. Faced with this phenomenon, we must consider other ways to resolve the anguish brought by mass data. From the economic perspective, people always want to spend the least money to store the most data. Even if the amount of data has not been large in www.redpel.com +917620593389 www.redpel.com +917620593389
  • 2. 234 Tsinghua Science and Technology, June 2015, 20(3): 233-245 the past, researchers have already been committed to discovering efficient compression storage. Therefore, how to reduce the amount of data transfer to complete the file synchronization is an issue which can bring tremendous value. Space efficiency is often achieved by the process of de-duplication[2–6] , which splits all the files in the system into blocks, and maintains only a unique copy of each block. De-duplication can improve the efficiency of storage, and reduce the network traffic. Typically, most de-duplication solutions have focused on reducing the space-overhead within a single server[7–9] . Under the environment in which the global data’s quantity grows sharply, at the requests of business applications and the requirements of legal compliance, de-duplication technology has become the companies’ first choice. De-duplication technology changes the data protection method by reducing the amount of stored data, which also improves the economics of disk backup. This has also gradually been recognized as the next generation backup technology in industry, which is the essential technology in data centers. At present, de-duplication technology has been concerned by the researchers and industry, and has had very fruitful research results. At the same time, many algorithms have been designed to detect and delete duplicate data in storage space or network transmission. Using data de-duplication technology to develop a safe, stable, and efficient cloud storage system has very important practical significance, both in terms of saving storage space and saving network bandwidth, even saving energy. The development of the de-duplication technique also enlarges the applying scope. Therefore, it not only can be used in backup and archiving, but also can cover the other network and desktop applications. Reference [10] shows a Similarity-Aware Partitioning (SAP) approach that is both access-efficient and space- efficient. But there are many inaccuracies in delta- encoding. For example, there are three non-empty files A1, A2, and A3. A1 D B1B2B3B4B5, A2 D B6B7, and A3 D B1B2B3B6B7 (where Bi is a block). Based on the delta-encoding of the SAP algorithm, ıfA1; A2g D fB6; B7g , ıfA1; A3g D fB6; B7g , since ıfA1; A2g D ıfA1; A3g D fB6; B7g , then judging the similarity between A1 and A3 is equal to the similarity between A1 and A2 . This method is lopsided because it only considers the differences between two files, without considering the size of the file itself. Moreover, the complexity of the SAP algorithm is very high; the amount of data on each server might have a big difference. In this paper, we present an efficient Frequency and Similarity-Aware Partitioning (FSAP) algorithm for cloud storage based on a Space-Time Utility Maximization Model (STUMM), which is more reasonable than the SAP algorithm. When calculating the similarity between two files, the FSAP algorithm not only considers the size of the redundant data between two files, but also considers the size of the two files. The FSAP algorithm also presents how to remove the redundancy among the different servers. In different degrees of further removal of redundancy (by setting different thresholds), the space used to store files is different, and the time to access files is also different. How to balance the relationship between space and access time to get maximum benefit is a big question. In order to solve this problem, we proposed a space-time utility maximization model. In Sections 2 and 3, we will describe the FSAP algorithm in detail, and show the reasons why the FSAP algorithm is more reasonable than the SAP algorithm. In Section 4, we will present the Space- Time Utility Maximization Model. The experimental results are presented in Section 5. We conducted two experiments: one for comparing the compression ratio and accessing efficiency between the FSAP algorithm and the SAP algorithm, and one for the reasonability of the Space-Time Utility Maximization Model. Finally, we will describe the related work and draw conclusions. 2 Data Partitioning Based on the Similarity Coefficient of Files In this section, we define the similarity coefficient between two files, compare it with the delta-encoding principle in the SAP algorithm[10] , and present the process of data partitioning. 2.1 The definition of similarity coefficient between two files Definition 1 The similarity coefficient between any two non-empty files Ai and Aj (where, i ¤ j) is Sim.Ai ; Aj / D .num.Ai Aj /=num.Ai // .num.Ai Aj /=num.Aj // D .num.Ai Aj //2 =.num.Ai / num.Aj //: Ai Aj refers to the blocks contained by both file Ai and file Aj . num.Ai Aj / represents the number of blocks contained by both file Ai and file Aj . Therefore, www.redpel.com +917620593389 www.redpel.com +917620593389
  • 3. Jianjiang Li et al.: Frequency and Similarity-Aware Partitioning for Cloud Storage 235 num.Ai Aj / 0; num.Ai / and num.Aj / indicate the number of blocks respectively contained by two non- empty files Ai and Aj , therefore, num.Ai / 1 and num.Ai / 1. Property 1 Sim.Ai ; Aj / 2 Œ0; 1: Generally, 0 num.Ai Aj / num.Ai / and 0 num.Ai Aj / num.Aj /. So, 0 Sim.Ai ; Aj / 1. If file Ai and file Aj are completely different, which means that the blocks contained by file Ai and file Aj are completely different, then num.Ai Aj / D 0. So, Sim.Ai ; Aj / D 0. If file Ai and file Aj are identical, which means that the blocks contained by file Ai and file Aj are identical, without regard to the order of blocks, then num.Ai Aj / D num.Ai / D num.Aj /. So, Sim.Ai ; Aj / D 1. When num.Ai Aj / is fixed, Sim.Ai ; Aj / will decline as the increase of num.Ai / or num.Aj /. In particular, when num.Ai / num.Ai Aj / or num.Aj / num.Ai Aj /, Sim.Ai ; Aj / is close to 0. Definition 2 Especially, if one of Ai and Aj (where, i ¤ j ) is null (i.e., it does not contain any block), then the similarity coefficient between Ai and Aj is defined as 0. If Ai and Aj (where, i ¤ j) are both empty (i.e., both Ai and Aj do not contain any block), then the similarity coefficient between Ai and Aj is defined as 1. Property 2 Sim.Ai ; Aj / D Sim.Aj ; Ai /: If file Ai and file Aj are two non-empty files, according to Definition 1, then Sim.Ai ; Aj / D .num.Ai Aj //2 =.num.Ai / num.Aj //; Sim.Aj ; Ai / D .num.Aj Ai //2 =.num.Aj / num.Ai // D .num.Ai Aj //2 =.num.Ai / num.Aj //: Therefore, Sim.Ai ; Aj / D Sim.Aj ; Ai /. If one of Ai and Aj is null, according to Definition 2, then Sim.Ai ; Aj / D Sim.Aj ; Ai / D 0. If Ai and Aj are both empty, according to Definition 2, then Sim.Ai ; Aj / D Sim.Aj ; Ai / D 1. In conclusion, Sim.Ai ; Aj / D Sim.Aj ; Ai /. According to Property 2, after Sim.Ai ; Aj / is calculated, we can know the value of Sim.Aj ; Ai /, and hence cut the computational effort by half. The similarity coefficient of two non-empty files proposed in this paper is different from the similarity of two non-empty files based on delta-encoding in SAP algorithm[10] . For example, according to the representation method of blocks, three non-empty files A1, A2, and A3 are expressed as A1 D B1B2B3B4B5, A2 D B6B7, and A3 D B1B2B3B6B7. According to the delta-encoding in the SAP algorithm, ıfA1; A2g D fB6; B7g and ıfA1; A3g D fB6; B7g: That is to say, ıfA1; A2g D ıfA1; A3g D fB6; B7g; determining the similarity of A1 and A3 is equivalent to the similarity of A1 and A2. However, the definition of the similarity coefficient between two non-empty files proposed by this paper shows Sim.A1; A2/ D .num.A1 A2//2 =.num.A1/ num.A2// D 0; Sim.A1; A3/ D .num.A1 A3//2 =.num.A1/ num.A3// D 32 =.5 5/ D 0:36: That is to say, A1 and A3 are more similar than A1 and A2. From the actual situation of A1, A2, and A3, A1 and A2 do not have common blocks. A1 and A3 have three common blocks: B1, B2, and B3. It also shows that the similarity degree between A1 and A3 is much higher than the similarity degree between A1 and A2. Therefore, the similarity coefficient between two non- empty files proposed by this paper is more reasonable than the delta-encoding principle in the SAP algorithm. Table 1 shows ten files A1, A2, , A10 according to the representation method of blocks, and Table 2 shows the similarity coefficients between every two files in Table 1, calculated according to the definition of the similarity coefficient proposed by this paper. Table 2 can be regarded as a symmetry matrix, where all elements on the primary diagonal are 1. Table 1 Files A1, A2, , A10 according to the representation method of blocks (where, Bi is a file block). A1 D B1B2B7 A6 D B3B5B6 A2 D B3B7 A7 D B3B4B6 A3 D B2B4B1B8 A8 D B4B1B7 A4 D B5B3B5B8 A9 D B5B3B7B8 A5 D B6B8 A10 D B4B7B6 Table 2 The similarity coefficients between any two files of Table 1. A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A1 1 0.17 0.33 0 0 0 0 0.44 0.08 0.11 A2 0.17 1 0 0.13 0 0.17 0.17 0.17 0.5 0.17 A3 0.33 0 1 0.06 0.13 0 0.08 0.33 0.06 0.08 A4 0 0.13 0.06 1 0.13 0.67 0.08 0 0.75 0 A5 0 0 0.13 0.13 1 0.17 0.17 0 0.13 0.17 A6 0 0.17 0 0.67 0.17 1 0.44 0 0.33 0.11 A7 0 0.17 0.08 0.08 0.17 0.44 1 0.11 0.08 0.44 A8 0.44 0.17 0.33 0 0 0 0.11 1 0.08 0.44 A9 0.08 0.5 0.06 0.75 0.13 0.33 0.08 0.08 1 0.08 A10 0.11 0.17 0.08 0 0.17 0.11 0.44 0.44 0.88 1 www.redpel.com +917620593389 www.redpel.com +917620593389
  • 4. 236 Tsinghua Science and Technology, June 2015, 20(3): 233-245 2.2 The process of data partitioning based on similarity coefficients When the total size of all files and the capacity of each server are known, we demonstrate how to evenly partition these files among servers according to the similarity coefficients of files. The files with high similarity coefficients are likely to be partitioned to the same server, which minimizes storage space when keeping a high access efficiency. Next, we will go over an example to demonstrate the process of data partitioning based on similarity coefficients. (1) According to the total size of all files and the capacity of each server, we can calculate the number of servers needed, which is the maximum number of the required servers (in general, after the redundancy is removed, the space occupied by these files is far less than the amount of space occupied by these files stored separately). These files can be placed according to the calculated number of servers. After completion of placing these files, if the amount of data placed on the server is less than the capacity of the server, we can fine-tune the data. For example, we can fine-tune the data on the server which has the smallest total similarity coefficient (or compression) to the other server. Here, we might as well suppose that the number of servers is 2 (when the number of server is 1, all of the data is placed on the same server). (2) According to the attraction (bigger similarity coefficient) and repulsion (smaller similarity coefficient) principle, we can choose pairs of files which will be placed on different servers. At first, we find two pairs of files with the largest similarity coefficient from Table 2, and respectively place them on server1 and server2. We choose A4 and A9 placed on server1, because the similarity coefficient of A4 and A9 is 0.75, which is the biggest similarity coefficient. Then we choose the pair, which has the biggest similarity coefficient, from the rest of the pairs of files. The similarity coefficients of A10-A7, A10-A8, A6-A7, and A1-A8 are all 0.44. As can be seen, among these four pairs of files, A7, A8, and A10 all appear twice, but A1 and A6 both appear only once. Therefore, we should select two files from A7, A8, A10 stored on server2. The similarity coefficients between A7, A8, A10, and A9 on the other server (i.e., server1) are identical, so we might as well choose A8 and A10 placed on server2. This process is shown in Fig. 1. A4A9 D fB3B5B7B8g (Now the size of data stored Fig. 1 The first step of data partitioning. on server1 is 4 blocks.) A8A10 D fB1B4B6B7g (Now the size of data stored on server2 is 4 blocks.) (3) From Table 3, we find file A2 which has the biggest similarity coefficient with the current blocks stored on server1 (the similarity coefficient is 0.5), so it will be placed on server1. From Table 4, we find out files A1 and A7 have the biggest similarity coefficient with the blocks currently stored on server2 (the similarity coefficients are all 0.33). The similarity coefficients between files A1, A7 and the block currently stored on server1 are all 0.08. According to the attraction and repulsion principle, here we can choose A1 (of course, we can also choose A7). This process is shown in Fig. 2. A4A9A2 D fB3B5B7B8g (Now the size of data stored on server1 is 4 blocks.) A8A10A1 D fB1B2B4B6B7g (Now the size of data stored on server2 is 5 blocks.) (4) From Table 5, we find file A3 which has the biggest similarity coefficient with the current blocks stored on server2 (the similarity coefficient is 0.45), so it will be placed on server2. At the same time, from Table 6, we find file A6 which has the biggest similarity coefficient with the block currently stored on server1 (the similarity coefficient is 0.33), so it will be placed on server1. This process is shown in Fig. 3. A4A9A2A6 D fB3B5B6B7B8g (Now the size of data stored on server1 is 5 blocks.) A8A10A1A3 D Table 3 Similarity coefficients between A4A9 on server1 and files not placed on any server. A1 A2 A3 A5 A6 A7 A4A9 0.08 0.5 0.06 0.13 0.33 0.08 Table 4 Similarity coefficients between A8A10 on server2 and files not placed on any server. A1 A2 A3 A5 A6 A7 A8A10 0.33 0.13 0.25 0.13 0.08 0.33 Fig. 2 The second step of data partitioning. www.redpel.com +917620593389 www.redpel.com +917620593389
  • 5. Jianjiang Li et al.: Frequency and Similarity-Aware Partitioning for Cloud Storage 237 Table 5 Similarity coefficients between A8A10A1 on server2 and files not placed on any server. A3 A5 A6 A7 A8A10A1 0.45 0.1 0.07 0.27 Table 6 Similarity coefficients between A4A9A2 on server1 and files not placed on any server (after file A2 is stored on server1, blocks on server1 does not change, so the similarity coefficients do not change). A3 A5 A6 A7 A4A9A2 0.06 0.13 0.33 0.08 Fig. 3 The third step of data partitioning. fB1B2B4B6B7B8g (Now the size of data stored on server2 is 6 blocks.) (5) From Table 7, we find file A5, which has the biggest similarity coefficient with the blocks currently stored on server1 (the similarity coefficient is 0.4), so it will be placed on server1. A4A9A2A6A5 D fB3B5B6B7B8g (At present, the size of the data stored on server1 is still 5 blocks). As can be seen from Tables 7 and 8, now the similarity coefficient between A7 and the blocks currently stored on server1 is larger than the similarity coefficient between A7 and the blocks currently stored on server2, and the current number of blocks stored on server1 is less than the number of blocks stored on server2, so in order to balance the distribution of blocks on servers and follow the attraction and repulsion principle, A7 will be also placed on server1. This process is shown in Fig. 4. A4A9A2A6A5A7 D fB3B4B5B6B7B8g (Now Table 7 Similarity coefficients between A4A9A2A6 on server1 and files not placed on any server. A5 A7 A4A9A2A6 0.4 0.27 Table 8 Similarity coefficients between A8A10A1A3 on server2 and files not placed on any server. A5 A7 A8A10A1A3 0.33 0.22 Fig. 4 The fourth step of data partitioning. the size of data stored on server1 is 6 blocks.) A8A10A1A3 D fB1B2B4B6B7B8g (Now the size of data stored on server1 is 6 blocks.) Now, all of files have been placed on the appropriate servers based on the similarity of files. 2.3 Redundancy removal based on the occurrence frequency of blocks From the process of data partitioning described above, the redundant blocks on server1 and server2 are B4, B6, B7, and B8. If the capacity of server1 and server2 is greater than the size of blocks needed to be placed on them, then the redundant blocks do not need to be removed. Otherwise, we need to decide the degree of redundancy removal based on the capacity of server1 and server2. The most extreme case is: blocks on all servers cannot be redundant. In the above example, among files placed on server1, the occurrence frequency of B4, B6, B7, and B8 is 1, 3, 2, 3, respectively; among files placed on server2, the occurrence frequency of B4, B6, B7, and B8 is 3, 1, 3, 1, respectively. To minimize the amount of access to the servers which store blocks, we can remove B4 and B7 stored on the server1, and remove B6 and B8 stored on server2. Eventually, the blocks placed on server1 are B3B5B6B8, and the blocks placed on server2 are B1B2B4B7. To obtain A2, A7 or A9 stored on server1 is required to access server2 once (the total number of accesses is 3). Similarly, to obtain A3 or A10 stored on server2, it is required to access to server1 once (the total number of times is 2). Therefore, redundancy removal based on the frequency of blocks is able to realize the data storage with smaller storage space and higher access efficiency. 3 The FSAP Algorithm In Section 2, we have explained the process of the FSAP algorithm through an example, including how www.redpel.com +917620593389 www.redpel.com +917620593389
  • 6. 238 Tsinghua Science and Technology, June 2015, 20(3): 233-245 to calculate the similarity coefficients between each set of two files, how to balance the distribution of blocks on servers based on the similarity of files, and how to remove redundancy based on the occurrence frequency of blocks. The FSAP algorithm can realize the data storage with smaller storage space and higher access efficiency. The key step of the FSAP algorithm (as shown in Algorithm 1) will be introduced in detail below. Algorithm 1 The FSAP Algorithm Input: n: the total number of files which will be distributed on servers set of files A D fA1; A2; ; Ang C: server capacity S.A/: the total size of A K: the maximum number of the required servers d: remaining server capacity N.A/: the number of files in A num.Ai /: the number of blocks in file Ai q: block size Output: the result of data partitioning 1: Initialize: the files in each server: D ∅; K = S.A/=C; d = C 2: find K pairs files with a similarity coefficient that is the largest in the set Sim.Ai ; Aj /, 1 i < j n 3: for all 1 m K do 4: add Aim and Ajm into Dm: Dm Dm C Aim C Ajm 5: remove Aim and Ajm from A: A A Aim Ajm ; N.A/ D N.A/ K 6: L1: 7: construct a K N.A/ matrix M 8: for all 1 x K do 9: for all 1 y N.A/ do 10: Mxy D Sim.Dx; Ay/ 11: while M is not empty do 12: find the largest element Mx1y1 in the M 13: if C num.Dx1 [ Ay1 / q then 14: remove x1-th row y1-th column from M 15: Dx1 Dx1 C Ay1 16: A A Ay1 17: else 18: remove x1-th row from M 19: if A is not empty then 20: goto L1 21: according to the occurrence frequency, remove redundant blocks from Dm, 1 m K 22: remove the redundant blocks among different servers based on STUMM 23: D D fD1; D2; ; DKg 24: return D Line 1: The result of data partitioning is initialized, that is D ∅. At the same time, the remaining server capacity is initialized by d D C, and the maximum number of the required servers is calculated by K D S.A/=C. Line 2: The similarity coefficients between each two files (Sim.Ai ; Aj /, 1 i < j n) are calculated according to the definition proposed in this paper, then K pairs of files, whose similarity coefficients are among the top K, are found. The K pairs of files are denoted by <Aim , Ajm >, where 1 m n. Lines 3-5: The K pairs of files are placed on K servers separately, that is, each server stores a pair of files, denoted by Dm. Files Aim and Ajm are added into Dm, then they are deleted from A. After K pairs of files are processed, the number of files in A will be reduced correspondingly. Line 6: L1 is a label of the goto statement. Lines 7-10: A K N.A/ matrix M is constructed, denoted by M D Sim.Dx; Ay/ (1 x k, 1 y N.A/). Each row of M corresponds to the similarity coefficients between the files placed on each server and the files not placed on any server. Each element of M is calculated by Sim.Dx; Ay/, where 1 x K and 1 y N.A/. Lines 11-18: This is a circulate process when M is not empty. Only the largest element Mx1y1 in M is found in each loop; in other words, file Ay1 is the most suitable file which should be placed on the x1-th server. Before these operations are executed, this algorithm needs to judge whether there are enough remaining spaces. On the other hand, in order to balance the amount of files on each server, only one file is placed on each server in each loop. If C num.Dx1 [ Ay1 / q, then x1-th row y1- th column will be removed from M. At the same time, file Ay1 and Ajm are added into Dx1 , then it is deleted from A. If C num.Dx1 [ Ay1 / q, then x1-th row will be removed from M. Lines 19-20: If the set of A is not empty, lines 7-18 will be executed repeatedly. Line 21: If the set of A has become empty, redundant blocks will be removed from Dm (where, 1 m K) based on the occurrence frequency of blocks. The redundant blocks with lower occurrence frequencies will be removed from the severs which store them. Line 22: According to the capacity of each server www.redpel.com +917620593389 www.redpel.com +917620593389
  • 7. Jianjiang Li et al.: Frequency and Similarity-Aware Partitioning for Cloud Storage 239 and the access time tolerated by users, the degree of redundancy removal will be determined. In Section 4, we will introduce the STUMM model in detail. Lines 23-24: the result of data partitioning is outputted. The complexity of the FSAP algorithm proposed in this paper is lower than the SAP algorithm: In the first step, the similarity coefficients between any two non- empty files have a large amount of calculations, and the memory usage is also larger. However, since the adjacency matrix in the FSAP algorithm is a symmetric matrix, the amount of calculation and the occupancy of the memory reduces by half. The amount of files on each server is nearly same after data partitioning: In order to balance the amount of files stored on each server, only one file, which has the highest similarity, is placed on each server in each loop. Therefore, the amount of files on each server is nearly the same. This data partitioning is more reasonable than that of the SAP algorithm. 4 Space-Time Utility Maximization Model Definition 3 Space-time utility P is a weighted average of data space usage and data access time. Namely, P D ˛=S C .1 ˛/=T; 0 ˛ 1; where S is data space usage, which means the total amount of data space usage on servers. T is data access time, which indicates the average of data access time to obtain all data files stored on servers. The SAP algorithm is just a special case (without consideration of removing the redundant data blocks among different storage servers) of the FSAP algorithm proposed in this paper. In addition, the SAP algorithm only considers data space usage S. However, the FSAP algorithm not only considers data space usage S, but also considers data access time T . When ˛ D 1, space-time utility P is completely controlled by data space usage S. On the other hand, when ˛ D 0, space-time utility P is completely controlled by data access time T . This paper takes into account the effect of data space usage and data access time on space-time utility. Therefore, the range of ˛ should be 0 ˛ 1. If 0 ˛ < 0:5, then data access time can do more to influence space-time utility than data space usage. If 0:5 < ˛ 1, then data space usage can do more to influence space-time utility than data access time. If ˛ D 0:5, then data space usage does the same to influence space-time utility as data access time. This paper will discuss how to solve the problem, under which circumstances the value of time utility P is the maximum. Generally, If all data can be obtained locally, data access time T is a constant T0 which does not change with data space usage S. Otherwise, when data space usage S becomes greater, data access time T will decrease due to the lower probability of data access across servers; when data space usage S becomes smaller, data access time T will increase because of the higher probability of data access across servers. Data access time T is a function of data space usage S; here, suppose that the function is T D ˇ=S, where, ˇ is a constant greater than 0. (1) If all data can be obtained locally, then Pmax D maxf˛=S C .1 ˛/=T0g. At this time, as data space usage S increases, data access time T is still constant T0 and will not decrease. Therefore, if all data can be obtained locally, Pmax D ˛=minS C .1 ˛/=T0. That is space-time utility P can reach the maximum value when redundant blocks within each server are completely eliminated. As we know, this is certain after data partitioning using the FSAP algorithm is proposed in this paper. Therefore, Pmax D ˛=S C .1 ˛/=T0. (2) If some data may be obtained remotely, then Pmax D maxf˛=SC.1 ˛/=.ˇ=S/g D maxf˛=SC.1 ˛/=ˇ Sg and @P=@S D ˛=S2 C.1 ˛/=ˇ. Therefore, when @P=@S D 0, namely ˛=S2 C.1 ˛/=ˇ D 0, that is, when S D p .˛ ˇ/=.1 ˛/, the space-time utility P can reach the extreme value ˛= p .˛ ˇ/=.1 ˛/ C .1 ˛/=ˇ p .˛ ˇ/=.1 ˛/. S can be determined according to different values of ˛ and ˇ. In order to obtain the maximum value of P , both S and T need to be converted into a unified dimension. The greater the S, the more storage cost paid by the data center; the larger the T , the higher the latency which can be tolerated by the data center. When the size of files assigned to each server is much less than the capacity of each server Sfile, the redundancy may be allowed, i.e., different servers can have the same blocks. When the size of files on each server is close to or exceeds the capacity of each server Sfile, the redundant blocks among servers must be removed according to the occurrence frequency of blocks. While redundancy removal technology improves the utilization of space, it reduces the reliability of data. www.redpel.com +917620593389 www.redpel.com +917620593389
  • 8. 240 Tsinghua Science and Technology, June 2015, 20(3): 233-245 Researchers have done experiments to analyze how much redundancy removal contributes to reducing the reliability of data[11] . The results of these experiments show that the availability of files is far less than the availability of data blocks. For example, when the equipment failure rate reaches 6%, 99.5% of data blocks are available, but only 96% of files are accessible. The reason for the results is that multiple files share common blocks, so there is a dependency between different files. The influence of redundancy removal on the reliability of data has long attracted the attention of scholars. The simplest way to improve the reliability of data is to create a copy using the Mirror technology, but it will greatly reduce the utilization rate of storage space, like Ceph[12] and GPFS[13] . All of these systems rarely take into account the utilization rate of storage space and the reliability of data. For cloud storage systems, in order to increase the availability of data, we should also keep some redundant data. How much redundant data should be kept is a difficult problem. Using the Space-Time utility maximization model, according to the capacity of each server and the access time tolerated by users, we can determine the degree of redundancy removal and make a more reasonable operation to obtain the maximum value of P . In Section 5, we will verify it by experiments. 5 Performance Evaluation and Analysis In this section, firstly, compare the compression ratio and the distribution of data blocks between SAP and FSAP by experiment; secondly, verify the reasonability of the Space-Time utility maximization model. 5.1 Comparison of the utilization rate of storage space of the FSAP algorithm with that of the SAP algorithm In this part, when the number of servers is maxf2, the number of files/20g (Each server is configured as the following: dual-core CPU and 4 GB memory), we compare the utilization rate of storage space of the FSAP1 algorithm, the FSAP2 algorithm, the SAP-Space-Delta algorithm, and the SAP-Space- Dedup algorithm. The FSAP1 algorithm is normal compression, and the FSAP2 algorithm is deep compression whose threshold value is 2; namely, if the frequency of the occurrence of redundant data blocks on one server is smaller than 2, these redundant blocks on this server will be removed. For example, the frequency of the occurrence of data block B1 on server1 is 1, and the frequency of the occurrence of B1 on server2 is 2. In the case of the FSAP1 algorithm, the redundant data blocks of B1 on both server1 and server2 will not be removed, but in the case of the FSAP2 algorithm, B1 on server1 will be removed, while B1 on server2 will be retained. We perform basic experiments on a set of files: 100 random web files downloaded from CNN. In the course of these experiments, we keep the block-size constant at 48, change the number of files in increments of 5, and measure the compression ratio achieved by the FSAP1 algorithm, the FSAP2 algorithm, the SAP-Space- Delta algorithm, and the SAP-Space-Dedup algorithm relative to the total size of the uncompressed files. The test results are shown in Table 9 and Fig. 5. On the other hand, 100 files are stored on 5 servers and Table 10 shows the distribution of data blocks using the FSAP1 algorithm, the FSAP2 algorithm, the SAP-Space-Delta algorithm, and the SAP-Space-Dedup algorithm. For files on CNN, the FSAP1 algorithm achieves a compression ratio of 23%, the FSAP2 algorithm achieves a compression ratio of 29.35%, the SAP- Space-Delta algorithm achieves a compression ratio of Table 9 The test results of the FSAP algorithm and the SAP algorithm. The number of files Compression ratio (%) SAP- Space-Delta SAP- Space-Dedup FSAP1 FSAP2 5 11.14 11.74 12.60 19.97 10 11.42 14.79 14.12 20.60 15 11.80 18.04 18.15 24.05 20 12.49 19.77 19.59 24.95 25 12.00 20.77 20.30 25.52 30 11.26 21.04 20.71 25.73 35 12.81 23.36 22.56 27.32 40 13.71 24.96 25.25 29.67 45 11.60 24.51 25.46 29.74 50 11.80 25.53 26.38 30.62 55 12.72 26.61 26.73 30.89 60 12.52 23.45 23.37 28.71 65 12.38 24.18 23.94 29.27 70 12.16 24.51 24.38 29.56 75 11.93 24.76 24.72 29.72 80 11.83 22.71 23.32 29.18 85 11.63 23.02 23.50 29.30 90 11.52 23.38 23.84 29.64 95 11.86 24.2 24.26 29.98 100 11.65 22.44 23.00 29.35 www.redpel.com +917620593389 www.redpel.com +917620593389
  • 9. Jianjiang Li et al.: Frequency and Similarity-Aware Partitioning for Cloud Storage 241 Fig. 5 The compression ratio of files from CNN using different algorithms. Table 10 The distribution of data blocks using different algorithms. server1 server2 server3 server4 server5 SAP-Space-Delta 7536 9988 11 848 14 362 20 721 SAP-Space-Dedup 6843 9055 10 521 12 560 17 606 FSAP1 10 582 11 267 12 635 11 122 10 567 FSAP2 9196 10 012 11 138 9565 9180 11.65%, and the SAP-Space-Dedup algorithm achieves a compression ratio of 22.44%. The compression ratio of CNN filed using the FSAP1 algorithm is obviously higher than the compression ratio of CNN filed using the SAP-Space-Delta algorithm, and it is slightly higher than the SAP-Space-Dedup algorithm. Because each of the test web files downloaded from CNN are approximately the same size, the curve of the FSAP1 algorithm and the SAP-Space-Dedup algorithm are very similar. Even in this case, the compression ratio of the FSAP1 algorithm is still higher than the compression ratio of the SAP-Space-Dedup algorithm. From the test results, we can find that (1) in the case of a fixed number of servers, as the number of files increases, the compression ratio gradually increases; (2) if the number of servers increases, then the compression ratio will decrease. As an analogy to the actual situation, if the capacity of each server is sufficient, then more files will be stored, and a higher compression ratio will be achieved. Conversely, if the capacity of a server is insufficient, then we need to increase the number of servers, and the compression ratio will decline. From Table 10, we can find that the distribution of data blocks using the SAP-Space-Delta algorithm and the SAP-Space-Dedup algorithm are seriously imbalanced. The number of data blocks on different servers are obviously different. But the distribution of data blocks using the FSAP1 algorithm and the FSAP2 algorithm is basically the same. The test results show that the FSAP algorithm achieves an even greater distribution of data blocks than the SAP algorithm. In general, the test results show that the FSAP algorithm proposed in this paper is better than the SAP algorithm. Further, the compression ratio using the FSAP2 algorithm is far greater than that using the FSAP1 algorithm, which also shows that the FSAP algorithm has a significant effect on redundancy removal of data blocks among different servers. 5.2 The test of Space-Time utility maximization model In order to test Space-Time utility maximization model, we perform experiments on a set of 100 random web files downloaded from CNN using the FSAP algorithm. The number of servers is fixed at 5, and each server is configured as dual-core CPU and 4 GB memory. We define a variable — threshold. For the redundant data blocks, if the occurrence frequency on one server is smaller than the threshold value, these redundant blocks on this server will be removed (under the premise of retain at least one). If the frequency on each server is larger than the threshold, these redundant blocks will not be removed, while the frequency is equal to the threshold, we only retain the redundant blocks on a single server. As the maximum occurrence frequency is 7 or 8[10] , we take the maximum threshold value is 10. We observe how S and T change over the threshold value of redundancy removal. The changes of S and T will directly effect P . The test procedure is as follows: (1) To obtain the change curves of S over the threshold value of redundancy removal. S is the total number of data blocks retained on 5 storage servers. (2) To obtain the change curves of T over the threshold value of redundancy removal. T is expressed by the results of numerical simulation. Its calculation standard is as follows: if one file is completely stored in one server, its access time is the number of data blocks in this file. After deep redundancy removal of data blocks, when one data block is removed, T will be increased by 1 because of the need to obtain the data block from other servers. (3) The dimensions of S and T are different; we need to unify them. It is an important step in the test of the Space-Time utility maximization model. The method www.redpel.com +917620593389 www.redpel.com +917620593389
  • 10. 242 Tsinghua Science and Technology, June 2015, 20(3): 233-245 adopted in this paper is as follows: S0 D .1=S min.1=S//=.max.1=S/ min.1=S//; T 0 D .1=T min.1=T //=.max.1=T / min.1=T //: (4) To obtain the change curves of P over the threshold value of redundancy removal. Now, P D ˛ S0 C .1 ˛/ T 0 ; 0 ˛ 1: The test results are shown in Table 11 and Figs. 6 and 7. From the test results, we can find that as the threshold value of redundancy removal increases, data space usage S decreases; meanwhile, data access time T increases. When the threshold value of redundancy removal is 2, 3, and 4, data space usage S significantly reduces and data access time T obviously increases. At last, data space usage and data access time tend to be a constant values, which means that at this time there are no redundant data blocks among servers. At the same time, Table 12 and Fig. 8 show that P can always obtain the extremum value in the range of threshold. When ˛ D 0, P gets the maximum value at the point of the minimum threshold; when ˛ D 1, P gets the maximum value at the point of the maximum threshold. When ˛ changes, P gets the maximum value at different thresholds. The curves of P show that space-time utility maximization model is reasonable. Additionally, using the Space-Time maximization model proposed in this paper, by observing the change of P with the threshold value of redundancy removal, we can achieve the best balance between data space usage and data access time. 6 Related Work Data de-duplication, as a compression method, has been widely used in most backup systems to improve bandwidth and space efficiency[14] . This technology can remove redundant data in the data stream. The finer the grain size, the more the redundant data removed, while more computing resources are used at the same time. De-duplication technology has already received a lot of attention in the industry. Single Instance Storage (SIS) on the Microsoft Windows Server2000 achieves the same file stored only once, while through forging links to provide a copy of the file system[15] . This method is simple, easy to use, and is more suitable for the collection of small files, but it cannot detect the same data block within a different Table 11 The test results of S and T (N is the number of servers.). Threshold N D 2 N D 3 N D 4 N D 5 S T S T S T S T 1 49 996 49 996 52 910 52 910 54 832 54 832 56 059 56 059 2 45 783 54 209 46 177 59 643 46 353 63 311 46 419 65 699 3 45 023 55 729 45 118 61 761 45 108 65 801 45 101 68 335 4 44 710 56 668 44 751 62 862 44 751 66 872 44 727 69 457 5 44 577 57 200 44 599 63 470 44573 67 584 44 548 70 173 6 44 504 57 565 44 517 63 880 44499 67 954 44 481 70 508 7 44 461 57 823 44 469 64 168 44448 68 260 44 434 70 790 8 44 429 58 047 44 438 64 385 44420 68 456 44 406 70 986 9 44 413 58 175 44 414 64 577 44403 68 592 44 384 711 62 10 44 399 58 301 44 401 64 694 44 383 68 772 44 366 71 324 Fig. 6 The relationship between S and the threshold value of redundancy removal. N is the number of servers. Fig. 7 The relationship between T and the threshold value of redundancy removal. N is the number of servers. www.redpel.com +917620593389 www.redpel.com +917620593389
  • 11. Jianjiang Li et al.: Frequency and Similarity-Aware Partitioning for Cloud Storage 243 Table 12 The value of P in different ˛˛˛. Thre- P shold ˛=0 ˛=0.2 ˛=0.5 ˛=0.6 ˛=0.8 ˛=1 1 1 0.8 0.5 0.4 0.2 0 2 0.314 42 0.409 13 0.551 19 0.598 55 0.693 25 0.787 96 3 0.160 63 0.312 88 0.541 25 0.617 37 0.769 62 0.921 87 4 0.098 71 0.271 23 0.530 01 0.616 27 0.788 79 0.961 31 5 0.060 24 0.244 27 0.520 32 0.612 34 0.796 38 0.980 41 6 0.042 50 0.231 52 0.515 05 0.609 56 0.798 58 0.987 61 7 0.027 70 0.220 70 0.510 18 0.606 68 0.799 67 0.992 66 8 0.017 49 0.213 13 0.506 58 0.604 40 0.800 04 0.995 68 9 0.008 36 0.206 30 0.503 20 0.602 18 0.800 12 0.998 06 10 0 0.2 0.5 0.6 0.8 1 Fig. 8 The relationship between P and the threshold value of redundancy removal. file. In order to delete duplicate data in the data block level, the fixed-size block method was first designed[16] . In this method, each file is divided into non-overlapping blocks of fixed size, and each data block has a hash value. Bell laboratory’s Venti[8] file storage system saves approximately 30% of the storage space using this method. IBM applied this technology to the online TotalStorage SAN File System[16] . In addition, the virtual machine migration also uses the fixed-length block divided method to improve the migration performance and reduce the network cost[17] . The technology of using this algorithm is very simple. The efficiency is also very good, but the changes to a file location are very sensitive, and the file header inserting a new byte can cause all data blocks to form into a new block. Low-Bandwidth File System (LBFS)[18] using the Rabin fingerprint[19, 20] instead of the fixed size divided the file into a variable block size. This method is called block divided method based on content, it makes the storage system no longer sensitive to position modifications. Network buffer system based on value, DDFS of Data Domain, the P2P file system of Pasta, and backup system of Pastiche also used this technology to identify and eliminate the redundant data. However, this algorithm has uncertainty on the definition of data block boundary, and has a higher overhead. Besides these algorithms, some techniques, such as delta- encoding[21] , Bloomfilter[22] , and sparse index[23] , were also used to enhance the performance of de-duplication technology. 7 Conclusions In this paper, we discuss the problem of redundant data in the cloud storage systems and study the main de- duplication technologies. The main goal of this paper is how to use less space to store more data files, and at the same time, not obviously affect access efficiency. Based on the study of other data de-duplication technologies (especially the SAP algorithm), we first propose a more reasonable data partitioning algorithm — FSAP, then present the Space-Time maximization model; finally, we compare the utilization rate of storage space of the FSAP algorithm with that of the SAP algorithm, and test the Space-Time utility maximization model through experiments on a set of 100 random web files downloaded from CNN. The FSAP algorithm overcomes the shortcomings of the SAP algorithm that calculates the similarity between two files that are not reasonable, and also reduces the computational complexity and the space occupied. The experimental results on files from CNN show that relative to using the SAP-Space-Delta algorithm and the SAP-Space-Dedup algorithm, the compression ratio using the FSAP algorithm respectively increased by 17.7% and 6.91%. At the same time, the test results show that the FSAP algorithm achieves a more even distribution of data blocks than the SAP algorithm. These facts show that the FSAP algorithm proposed in this paper is much more efficient than the SAP algorithm. Data de-duplication does not mean that there is no data redundancy; proper data redundancy can reduce access time and increase the reliability of data access. Using the space-time maximization model proposed in this paper, by observing the change of P with the threshold value of redundancy removal, we can achieve the best balance between data space usage and data access time. www.redpel.com +917620593389 www.redpel.com +917620593389
  • 12. 244 Tsinghua Science and Technology, June 2015, 20(3): 233-245 Acknowledgements This work was supported by the National High-Tech Research and Development (863) Program of China (No. 2015AA01A303). References [1] T. Benson, A. Akella, and D. A. Maltz, Network traffic characteristics of data centers in the wild, in Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, ACM, 2010, pp. 267–280. [2] D. T. Meyer and W. J. Bolosky, A study of practical deduplication, ACM Transactions on Storage (TOS), vol. 7, no. 4, p. 14, 2012. [3] A. T. Clements, I. Ahmad, M. Vilayannur, and J. Li, Decentralized deduplication in san cluster file systems, in USENIX Annual Technical Conference, 2009, pp. 101–114. [4] U. Manber, Finding similar files in a large file system, in Usenix Winter 1994 Technical Conference, 1994, pp. 1–10. [5] B. S. Baker, On finding duplication and near-duplication in large software systems, in Reverse Engineering, 1995, Proceedings of 2nd Working Conference on, IEEE, 1995, pp. 86–95. [6] G. Forman, K. Eshghi, and S. Chiocchetti, Finding similar files in large document repositories, in Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, ACM, 2005, pp. 394–400. [7] J. R. Douceur, A. Adya, W. J. Bolosky, P. Simon, and M. Theimer, Reclaiming space from duplicate files in a serverless distributed file system, in Distributed Computing Systems, 2002. Proceedings. 22nd International Conference on, IEEE, 2002, pp. 617–624. [8] S. Quinlan and S. Dorward, Venti: A new approach to archival storage, FAST, vol. 2, pp. 89–101, 2002. [9] B. Zhu, K. Li, and R. H. Patterson, Avoiding the disk bottleneck in the data domain deduplication file system, FAST, vol. 8, pp. 1–14, 2008. [10] B. Balasubramanian, T. Lan, and M. Chiang, Sap: Similarity aware partitioning for efficient cloud storage, in Infocom 2014 Proceedings IEEE, 2014, pp. 592–600. [11] D. Bhagwat, K. Pollack, D. D. Long, T. Schwarz, E. L. Miller, and J. F. Paris, Providing high reliability in a minimum redundancy archival storage system, in Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 2006. MASCOTS 2006. 14th IEEE International Symposium on, IEEE, 2006, pp. 413– 421. [12] T. J. Schwarz, Q. Xin, E. L. Miller, D. D. Long, A. Hospodor, and S. Ng, Disk scrubbing in large archival storage systems, in Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings. The IEEE Computer Societys 12th Annual International Symposium on, IEEE, 2004, pp. 409–418. [13] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. Long, and C. Maltzahn, Ceph: A scalable, high-performance distributed file system, in Proceedings of the 7th Symposium on Operating Systems Design and Implementation, USENIX Association, 2006, pp. 307–320. [14] R. Zhu, L.-H. Qin, J.-L. Zhou, and H. Zheng, Using multithreads to hide deduplication i/o latency with low synchronization overhead, Journal of Central South University, vol. 20, pp. 1582–1591, 2013. [15] W. J. Bolosky, S. Corbin, D. Goebel, and J. R. Douceur, Single instance storage in windows 2000, in Proceedings of the 4th USENIX Windows Systems Symposium, Seattle, USA, 2000, pp. 13–24. [16] B. Hong, D. Plantenberg, D. D. Long, and M. Sivan- Zimet, Duplicate data elimination in a san file system, in Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, 2004, pp. 301–314. [17] C. P. Sapuntzakis, R. Chandra, B. Pfaff, J. Chow, M. S. Lam, and M. Rosenblum, Optimizing the migration of virtual computers, ACM SIGOPS Operating Systems Review, vol. 36, no. SI, pp. 377–390, 2002. [18] A. Muthitacharoen, B. Chen, and D. Mazieres, A low bandwidth network file system, ACM SIGOPS Operating Systems Review, vol. 35, no. 5, pp. 174–187, 2001. [19] M. O. Rabin, Fingerprinting by random polynomials, Center for Research in Computing Technology, Harvard University, Report TR-15-81, 1981. [20] A. Z. Broder, Some applications of rabins fingerprinting method, in Sequences II. Springer, 1993, pp. 143–152. [21] F. Douglis and A. Iyengar, Application-specific deltaencoding via resemblance detection. in USENIX Annual Technical Conference, General Track, 2003, pp. 113–126. [22] B. H. Bloom, Space/time trade-offs in hash coding with allowable errors, Communications of the ACM, vol. 13, no. 7, pp. 422–426, 1970. [23] M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezis, and P. Camble, Sparse indexing: Large scale, inline deduplication using sampling and locality, in 7th USENIX Conference on File and Storage Technologies, 2009, pp. 111–123. www.redpel.com +917620593389 www.redpel.com +917620593389
  • 13. Jianjiang Li et al.: Frequency and Similarity-Aware Partitioning for Cloud Storage 245 Jie Wu is the chair and a Laura H. Carnell Professor in the Department of Computer and Information Sciences at Temple University. Prior to joining Temple University, USA, he was a program director at the National Science Foundation and a distinguished professor at Florida Atlantic University. He received his PhD degree from Florida Atlantic University in 1989. His current research interests include mobile computing and wireless networks, routing protocols, cloud and green computing, network trust and security, and social network applications. Dr. Wu regularly published in scholarly journals, conference proceedings, and books. He serves on several editorial boards, including IEEE Transactions on Computers, IEEE Transactions on Service Computing, and Journal of Parallel and Distributed Computing. Dr. Wu was general co-chair/chair for IEEE MASS 2006 and IEEE IPDPS 2008 and program co-chair for IEEE INFOCOM 2011. Currently, he is serving as general chair for IEEE ICDCS 2013 and ACM MobiHoc 2014, and program chair for CCF CNCC 2013. He was an IEEE Computer Society Distinguished Visitor, ACM Distinguished Speaker, and chair for the IEEE Technical Committee on Distributed Processing (TCDP). Dr. Wu is a CCF Distinguished Speaker and a Fellow of the IEEE. He is the recipient of the 2011 China Computer Federation (CCF) Overseas Outstanding Achievement Award. Jianjiang Li is currently an associate professor at University of Science and Technology Beijing, China. He received his PhD degree in computer science from Tsinghua University in 2005. He was a visiting scholar at Temple University from Jan. 2014 to Jan. 2015. His current research interests include parallel computing, cloud computing, and parallel compilation. Zhanning Ma is currently a master student in University of Science and Technology Beijing for his degree. His research interests include distributed storage technology and data duplicate removal. He received his bachelor’s degree in 2010 from Taishan Medical University. www.redpel.com +917620593389 www.redpel.com +917620593389