SlideShare ist ein Scribd-Unternehmen logo
1 von 11
Downloaden Sie, um offline zu lesen
TSINGHUA SCIENCE AND TECHNOLOGY
ISSNll1007-0214ll03/11llpp17-27
Volume 20, Number 1, February 2015
I-sieve: An Inline High Performance Deduplication System Used in
Cloud Storage
Jibin Wang, Zhigang Zhao, Zhaogang Xu, Hu Zhang, Liang Li, and Ying Guo
Abstract: Data deduplication is an emerging and widely employed method for current storage systems. As this
technology is gradually applied in inline scenarios such as with virtual machines and cloud storage systems, this
study proposes a novel deduplication architecture called I-sieve. The goal of I-sieve is to realize a high performance
data sieve system based on iSCSI in the cloud storage system. We also design the corresponding index and
mapping tables and present a multi-level cache using a solid state drive to reduce RAM consumption and to optimize
lookup performance. A prototype of I-sieve is implemented based on the open source iSCSI target, and many
experiments have been conducted driven by virtual machine images and testing tools. The evaluation results
show excellent deduplication and foreground performance. More importantly, I-sieve can co-exist with the existing
deduplication systems as long as they support the iSCSI protocol.
Key words: I-sieve; cloud storage; data deduplication
1 Introduction
With the rapid development of information technology,
large amounts of data are stored in different storage
systems. According to EMC reports[1]
, the amount of
data will reach 44 zettabytes by 2020. At the same time,
the total shipped disk storage systems capacity only
reached 7.1 exabytes in 2011 based on IDC reports[2]
.
There is a huge gap between data production and
available storage capacity for the foreseeable future,
which means that much of the information will get
discarded. On the other hand, even though disk cost
per GB has seen significant declines in recent years,
the total storage costs have revealed a general trend of
rapid increase[3]
. One of the promising technologies for
Jibin Wang, Zhigang Zhao, Zhaogang Xu, Hu Zhang, Liang
Li, and Ying Guo are with the Shandong Computer Science
Center (National Supercomputer Center in Jinan), Shandong
Provincial Key Laboratory of Computer Networks, Jinan
250101, China. E-mail: fwangjb, zhaozhg, xuzhaogang,
zhanghu, liliang, guoyingg@sdas.org.
To whom correspondence should be addressed.
Manuscript received: 2014-11-15; revised: 2014-12-15;
accepted: 2015-01-07
improving storage efficiency is deduplication, which is
considered to be one of the best optimizations to reduce
storage costs. This technology can effectively eliminate
redundant data blocks by using pointers to reference
an already stored identical data block in the storage
system. Deduplication can work with many data types,
including block[4–7]
and sub-file[8–11]
, also known as
super-block or fixed-size container file, whole-file[12]
,
and hybrid-file[13, 14]
.
Historically, data deduplication technology is only
applied on the backup system[15–17]
. This is because
backup applications consist primarily of duplicate
data, especially for cyclical full backup applications.
Another reason is that most backup systems do
not need a high foreground performance. The main
indicators to evaluate backup system performance
are data throughput and deduplication ratio. Similar
to optimizations in cloud applications[18, 19]
or cloud
services[20, 21]
, there are many promising studies
around performance optimizations resulting from
deduplication. Prior work on deduplication has
primarily focused on the optimization of the current
backup systems, including chunking algorithms[5, 22–24]
and performance optimizations[7, 16]
. Many of the
www.redpel.com +917620593389
www.redpel.com +917620593389
18 Tsinghua Science and Technology, February 2015, 20(1): 17-27
solutions have been applied in the current commercial
systems as the secondary storage.
With the widespread use of primary storage systems
for storing large amounts of data, such as online video,
user directories, and VM images, there has been more
researches[6, 7, 22, 25]
focusing on inline deduplicaton for
primary workloads rather than backup environments
only, especially with cloud storage[26–28]
. The challenge
with inline deduplication is to determine mechanisms
to decrease the high latency introduced due to
deduplication operations, including hash table indexing
and metadata management. Ideally, an in-memory hash
table can decrease latency effectively, however, the
reality is that only a small subset of related tables can
be resident in the limited memory due to the large scale
of the current deduplication systems; as a result, most
of them are migrated to disk.
Another popular research area is to determine how
to speed-up the table lookup process and reduce the
number of chunk hashes in RAM. There are three types
of optimizations to reduce the size of the hash table
in RAM-reducing hash length, using a big chunk size,
and using a sampling hash. It is simple to compute a
hash using a shorter hash function (e.g., 32-bit MD5 or
40-bit SHA-1), however, there are challenges that arise
as a result. Specifically, as the number of the chunks
increase, more hash collisions can occur for different
chunk blocks, which can lead to data inconsistency in
the storage system. A large average chunk size in the
chunking algorithm can be used to as an efficient way to
cut down the total number of chunks in RAM. However,
this strategy can reduce the deduplication ratio since
the average chunk size in the current file system is still
small[29]
. Separate from the above two methods, the
sampling hash is widely used in many deduplication
systems, especially for large scale storage systems.
It is also relatively suitable for combining with hash
and cache structures to improve index performance;
therefore, we also utilize the sampling hash in our I-
sieve design.
Given the significant difference in performance
between RAM memory and disk, fast storage devices
such as flash memory (typically known as Solid
State Drive (SSD) can be a feasible solution in
deduplication systems. It can play the role of a buffer
cache, effectively narrowing the gap between memory
and disk. In ChunkStash[7]
, all metadata is stored
in SSD in order to improve indexing performance;
this optimization only shows marginal loss in the
deduplication quality. Compared to increasing memory
capacity, using flash memory SSD is a more effective
way to improve lookup performance.
This study describes the design, implementation, and
evaluation of our deduplication system (called I-sieve)
based on the open source code iSCSI. We also propose
a series of optimizations to improve inline performance
for primary workloads. Our evaluation shows up to
an 82.96% data savings exist in office environments,
together with an improvement of about 91% in the user
response time in case of common I/O applications. In
summary, our key contributions primarily include the
following.
Analysis of the features of current deduplication
systems together with a detailed summary of their
challenges.
An inline deduplication system (I-sieve), for
primary workloads, that serves as a module
embedded in the cloud storage system.
Design of novel index tables aimed at block level
deduplication and a new multi-cache structure
using SSD to boost foreground performance.
Implementation of an I-sieve prototype in the
Linux operating system based on an open source
code iSCSI target.
Evaluation of our I-sieve performance by
simulating real world applications such as
office and VM environments.
The remainder of this paper is organized as follows.
In Section 2, we provide the background description
and motivation for this paper. In Section 3, we present
the design of the I-sieve, including the indexing table
and the multi-level cache. The evaluation method,
experimental settings, and workloads are presented in
Section 4 followed by numerical results and analysis in
Section 5. In Section 6, we give an overview of related
work, and in Section 7, we conclude the paper.
2 Background and Motivation
Before discussing our deduplication system, we first
introduce the traditional deduplication process in
current storage systems, and subsequently analyze its
strengths and weaknesses.
2.1 Traditional deduplication system
As mentioned above, data deduplication is one of the
popular data reduction technologies, which has been
applied in many backup storage systems[8–11]
. In Fig. 1,
we summarize the general deduplication process. There
www.redpel.com +917620593389
www.redpel.com +917620593389
Jibin Wang et al.: I-sieve: An Inline High Performance Deduplication System Used in Cloud Storage 19
Fig. 1 I/O request process in traditional deduplication systems.
are two major I/O operations for any storage system,
read and write. As shown in Fig. 1a, when the system
deals with write I/O requests from the frontend, data
deduplication operations happen on the write path of the
file system, and the following four steps are performed.
(1) File chunking. In this step, all files are
divided into different types of blocks by the chunking
algorithms (e.g., fixed-sized blocks[15]
and variable-
sized blocks[5, 24, 30]
). The system then computes a
fingerprint of each block using a hash function such as
a SHA-1 hash[31]
or an MD5 hash[32]
. Subsequently,
all processed blocks with corresponding fingerprints are
handled by the next deduplication module.
(2) Deduplication checking. In this stage, the
fingerprint of split blocks is checked against the
fingerprint table. If the block has a duplicate in the
table, then this block is not written to disk; instead only
the metadata is recorded (similar to the organization of
super blocks or block groups in the EXT file system).
However, if the block does not have a duplicate in the
table, then the fingerprint of this block is inserted into
fingerprint table as a unique block in the system. This
step involves an extra fingerprint indexing process to be
performed to check for duplicate data when compared
with the traditional I/O process.
(3) Metadata updating. As the block indexing process
completes, the corresponding metadata of the file is
updated, including the mapping of blocks and files
based on pointers. Subsequently, the metadata is stored
in designated zones on the disk by periodic destage
operations.
(4) Data blocks storage. The final step in the write
path is to store the unique data blocks. The storage
elements in most current storage system are organized
by containers[7]
or bins[10]
, which contain a specified
number of data blocks on the disk. When the container
or bin reaches the target size, it is sealed and written
from RAM memory to disk. In addition, some extra
block metadata is recorded to identify data blocks
belonging to a container or file.
Figure 1b shows a data view of the traditional
deduplication system after storing two files, FileA and
FileB. As seen in the figure, blocks a and b are
the deduplicated data blocks in those two files. Only
a single copy is stored on the disk and appropriate
metadata is also created during this process. As seen
in Fig. 1c, the process of handling read I/O for FileA
is inverse of the process with write I/O. First, the
system reads the corresponding metadata information,
and subsequently obtains all blocks associated with
FileA. Next, the system reads the relevant blocks from
disk. Since the requested blocks may exist in various
containers or bins on the disk, this step can be a time
consuming process requiring more than one I/O request.
Finally, the data blocks are returned back to the system.
Based on the processing flow, we can see that the
traditional deduplication module is tightly bound to the
file system in the entire read and write path-embedded
in the reformed file system such as liveDFS[33]
or
in a specific file system like Venti[15]
or HydraFS[34]
.
In other words, current deduplication systems are co-
evolved with their applications, and are not suitable
to deploy as an independent module. In addition, the
complex metadata management model also restricts the
performance exertion, especially for the mappings in
the deduplication operation work path, even though
some optimizations[6, 7, 16, 22]
improve deduplication
performance to some extent.
2.2 Classifying deduplication systems
Before designing a deduplication system, we first
describe the classification of deduplication systems to
be better suited for the required applications. Currently,
there are many classifications for deduplication
www.redpel.com +917620593389
www.redpel.com +917620593389
20 Tsinghua Science and Technology, February 2015, 20(1): 17-27
systems.
Primary versus secondary deduplication systems.
This classification is based on the workloads served.
Primary deduplication systems are designed for
improving performance, in which workloads are
sensitive to I/O latency. Secondary deduplication
systems are primarily used for secondary storage
environments, such as backup applications, which
require high data throughput. As mentioned above, data
deduplication operations are time-consuming, which is
why primary deduplication systems are rarely used in
actual production environments. However, considering
the small-scale application of secondary storage, we
believe that current primary storage systems with
deduplication are a feasible scheme since significant
duplicate data exists in primary storage workloads as
well.
Post-processing versus inline deduplication. Post-
processing deduplication is an out-of-band approach
where data is not deduplicated until after the backup
has completed. In other words, with post-processing
deduplication, new data is first stored on the storage
device and then processed at a later time for duplication.
On the other hand, with inline deduplication, chunk
hash calculations are created on the target device as data
enters the storage devices in real-time. In this paper, we
primarily focus on the inline deduplication system for
its high space utilization and real-time characteristics.
We summarize the current deduplication studies, and
list the following observations that drive us to build a
new deduplication system.
More duplicate data exists in many typical
environments. From the conclusions[29]
, there
exists approximately 50% duplicate data in the
primary storage system and as high as 72% in
backup datasets; all the data for analysis comes
from different departments of Microsoft. Some
cloud-based applications, such as live virtual
machine environments, can achieve at least 40%
data reduction[33]
. The characteristics of data
storage for those scenarios make it possible
for us to build a small-scale, high performance
deduplication system (e.g., private-cloud storage).
Complex data management in the work path. The
operations in a traditional deduplication system
consist of read, write, and delete operations,
with every operation path updates much of
the information, including mapping table, block
metadata, fingerprint table, and so on. All these
work paths have optimization of space to boost the
whole work performance.
Poor performance in inline deduplication systems.
Most of the current deduplication systems are used
with backup scenarios. However, for real-time
applications, these systems are not suitable due to
their poor read or write performance.
Given this, we propose an inline, block level
primary deduplication system with high performance
called I-sieve. The goal of I-sieve is to build a
small-scale storage system for use in an office or
private-cloud environments. In order to address the
challenge of low performance in inline deduplication
systems, a lightweight indexing table and a two-level
cache structure are used to improve deduplication
performance in I-sieve. The evaluations show excellent
tradeoff of I-sieve between foreground performance and
the deduplication ratio.
3 Structure of I-sieve and the Indexing
Table Design
In this section, we first review the architectural of I-
sieve. Next, we present an efficient fingerprint indexing
algorithm. Our I-sieve system always assumes that the
total size of the storage system is 10 TB; all designs
of the mapping table and cache are based on this
assumption.
3.1 Structure of I-sieve
As shown in Fig. 2, I-sieve is a backend storage module,
and all the foreground clients access the storage service
using the iSCSI protocol. Thus, I-sieve can be perceived
as a data sieve, which eliminates much of the data
redundancy seen in file systems. The implementation
of I-sieve is a module imbedded into iSCSI; therefore,
it can be incorporated with many storage applications,
and also be easily deployed.
As seen in Fig. 2, I-sieve consists of three key
functional components, Deduplication engine, Multi-
cache, and a Snapshot module.
The Deduplication engine provides block-level
data deduplication, including hash table (also called
fingerprint table) management and an I/O remapping
module. All the foreground write requests are handled
within the deduplication engine, and then transferred to
disk request queues.
The Multi-cache module primarily handles cache
requests, and also acts as a bridge that connects
memory, SSD, and disk devices. The snapshot module
www.redpel.com +917620593389
www.redpel.com +917620593389
Jibin Wang et al.: I-sieve: An Inline High Performance Deduplication System Used in Cloud Storage 21
Fig. 2 Architectural overview of I-sieve.
is responsible for the reliability of the deduplicated
data blocks; it does so by performing periodic snapshot
operations.
From the view of each storage application, I-
sieve plays the role of a bridge between file system
and storage devices. All frontend I/O requests to
storage devices are redirected to the deduplication
engine first. As shown in Fig. 2, the deduplication
engine is below the file system layer; consequently,
the only work of this engine is to handle file I/O
requests. Unlike other deduplication approaches, I-
sieve has simpler data processing logic; thus, its
operation work path is relatively shorter, which makes
the deduplication process more efficient. This is
because the deduplicated data blocks only need some
remapping operations without needing more I/O request
information, which is the biggest difference between
block-level deduplication and other methods.
3.2 Design of the index tables
As described in Fig. 2, the index tables, including the
fingerprint table and other mapping tables, are required
in every deduplication operation. The efficient design
of the index table can result in great performance
improvements, especially for the inline primary
deduplication system.
There are two challenges to be addressed in the
design of the fingerprint index tables in I-sieve, low cost
memory consumption and efficient remapping logic.
Generally, the traditional fingerprint table uses the full
hash index for data deduplication, however, this can
result in memory exhaustion due to large data volumes.
For example, if the deduplication system has 10 TB
capacity and uses a 4 KB block size for chunking, then
there are about 2:68 109
unique blocks within the
system. Assuming that each hash entry in the index
table takes 30 bytes, then the total memory consumption
is about 75 GB just for the fingerprint table. Therefore,
it is not a wise decision to hold the entire hash table in
RAM memory.
On the other hand, there is also a need for the
remapping logic to redirect all frontend LBAs to their
real positions on the disks, since blocks with different
LBAs may have the same data content. In our index
table design, we use the SHA-1 hash[31]
, for its collision
resistant properties, to compute the fingerprint. In total,
there are 20 bytes in each entry with 160 bits of output
for this hash function.
As shown in Fig. 3, we resolve the above issues
with three key tables, LBA Remapping Table (LRT),
Multi-index Table (MT), and Hash node Table (HT). We
provide more details below.
LRT is responsible for remapping the frontend I/O
requests to the fingerprint index table. The first item
of each entry in LRT records the LBA from the top
of the file system, and the second item stores pointer
information of the fingerprint table.
Considering the size of the fingerprint table, we
present a novel structure, called MT, to reduce the table
size. Instead of storing the full index table in RAM,
MT is divided into many cells, called buckets, based
on the first three 8 bits of the full fingerprint of the
block. Every bucket has 256 entries with the same hash
prefix. As MT increases, all identical prefix buckets are
organized with linked lists level by level. In order to
save space with the HT, we use the remaining 136 bits
as the key in HT in the last level. The design motivation
of MT has another consideration, which also facilitates
the cache of buckets using the high performance of
SSD.
3.3 Performance optimization using Bloom filter
As can be seen from Fig. 3, even though the multi-index
table can reduce the scale of the index table in memory,
www.redpel.com +917620593389
www.redpel.com +917620593389
22 Tsinghua Science and Technology, February 2015, 20(1): 17-27
Fig. 3 Index table structures and the mapping process in I-sieve.
there still are a large number of queries to check the
deduplication fingerprint. This lookup process can be
time-consuming, even with an SSD. Thus, we use
a high-efficiency tool bloom filter[35]
to optimize the
indexing process. The description of the bloom filter
is as follows.
An empty bloom filter is a bit array of m bits, all set
to 0. There must also be k independent hash functions
defined, each of which maps or hashes one element to
a position in the m-bit array with a uniform random
distribution. When an element is added, it is fed to each
of the k hash functions to get k array positions. Then
the bits are set at all these positions to 1. In order to
query for an element (test whether it is in the array),
a check is performed to see if any of the bits at these
positions are 0, and if so, the element is not in the array.
If, however, all the bits are 1, then either the element is
in the array, or the bits are set to 1 during the insertion
of other elements, which results in a false positive. The
idea is to allocate a vector to test whether a data block
exists in the system; bloom filters provide a fast and
space-efficient probabilistic data structure used in many
deduplication systems[16, 36]
. However, false positive
retrieval results are possible in bloom filters, which can
be represented by the following expression,
pfalse positive D 1
Â
1
1
m
Ãkn
!k
1 e
kn
m
Ák
;
where k denotes the number of the independent hashes,
m and n represent the table size (array size as mentioned
above) and the number of blocks in the storage system
respectively. In order to reduce the possibility of false
positives in the duplicate checking process, we select
10 hash functions in I-sieve, which results in a false
positive percentage of only about 0.098%. Assume that
the total number of blocks n is 2:7 109
for a 10 TB
storage system; then the memory consumption is about
4.5 GB, which is within the acceptable range.
3.4 Multi-level cache of I-sieve
Besides the optimization of index performance, the
organization of the index tables and data blocks is
important for the overall system performance as well.
Therefore, we introduce a flash-based storage device
SSD acting as a fast cache between RAM memory and
disk.
Figure 4 shows the description of our multi-level
cache structure. I-sieve uses a Cache Mapping Table
(CMT) to manage different metadata and data caches in
RAM and SSD. For the metadata, I-sieve uses two-level
mappings, RAM-SSD, which means that the SSD stores
all the metadata. Some of frequently-used metadata
is migrated from SSD or disk to RAM in order to
boost performance. Specifically, the first three levels
of buckets are persisted in RAM, only taking up about
64 MB. Due to the scale of each HT shown in Fig. 4,
I-sieve selects some hot HTs based on request count to
store in RAM memory; the rest are stored on SSD. For
the cached data blocks, there are three tiers of mapping
in I-sieve, RAM, SSD, and Disk. All deduplicated data
blocks are eventually stored on disks, however, newly
added blocks are written to temporary SSD locations
Fig. 4 Structure of the multi-cache in I-sieve.
www.redpel.com +917620593389
www.redpel.com +917620593389
Jibin Wang et al.: I-sieve: An Inline High Performance Deduplication System Used in Cloud Storage 23
and then imported into disks in a particular time zone.
The system also caches hot spot blocks in RAM in order
to improve read performance; the selection of blocks is
based on the value of the reference count described in
HT.
4 Evaluations
For our evaluations of I-sieve, we primarily use the
deduplication ratio and the user response time as the
key performance indicators. The experimental settings
for each test are the same, and each test is run three
times to eliminate any discrepancies.
4.1 Experimental settings
The testbed details of the server and client are listed in
Table 1, and they are connected via a gigabit switch.
In addition, a Microsoft iSCSI software initiator is
installed on the client and the corresponding iSCSI
target with the I-sieve module is deployed on the storage
server, on which a commercial cloud storage system is
deployed.
4.2 Implementation and workloads
We implement I-sieve prototype based on the iSCSI
code[37]
with about 2000 lines of C code. The I-sieve
prototype is released separately as a module; thus, it can
be easily deployed in most application environments.
The workloads that drive our experiments have two
parts. One is a live Virtual Machine (VM) image, the
other is the IOmeter tool. To simulate a real work
environment, we setup two Windows systems in virtual
machines together with some frequently applications
(e.g., MS Office, application development environment,
etc.). The details of the experimental workloads are
listed in Table 2. Note that all VMs in our experiments
are set to 10 GB in size.
We also evaluate foreground performance using the
Table 1 Hardware details of the storage server and the
client. Note that all hardware on both the client and the
server are the same except for the OS and memory.
OS (server) Fedora 14 (kernel-2.6.35)
OS (client) Windows 7 Ultimate with SP1(X86)
Mainboard Supermicro X8DT6
Disks 2 Seagate ST31000524AS, 1 TB
SSD server 1 Intel SSD 520 Series, 128 GB
CPU Intel(R) Xeon(R) CPU 5650
NIC Intel 82574L
Memory (server) 32 GB DDR3
Memory (client) 8 GB DDR3
Table 2 Operating systems and typical applications used
in the evaluation. Note that the system setup images and
software are obtained from official websites or mirror sites.
All their install processes use the default settings in VMs.
Operating
system
VM descriptions and abbreviations
None Office 2003 Office 2007 VS 2005 VS 2008
Windows XP
SP2
SP2 SP2+O3 SP2+O7 SP2+V5 SP2+V8
Windows XP
SP3
SP3 SP3+O3 SP3+O7 SP3+V5 SP3+V8
IOmeter tool. The request size generated by IOmeter
is 4 KB because the average file size in the file
system is 4 KB[29]
. Since some studies[29, 33]
have
confirmed that fixed-sized block chunking is suitable
for VM applications, we adopt the fixed-sized chunking
algorithm in our deduplication engine in I-sieve.
Unless otherwise noted, the workloads used in
evaluations run completely on the client, and we
observe and analyze experimental results based on the
original record from the storage server.
5 Results Analysis
We first validate whether I-sieve can eliminate data
redundancy significantly in virtual machines. Figure 5a
shows the actual data capacity increasing based on the
Fig. 5 Performance of the data deduplication in I-sieve.
Note that all live virtual machine images are stored to the
storage pool in order.
www.redpel.com +917620593389
www.redpel.com +917620593389
24 Tsinghua Science and Technology, February 2015, 20(1): 17-27
deployment orders. I-sieve gets 82% space saving for
the Windows SP2 virtual image. Excluding the zero-
filled blocks of the virtual machine image[38]
, I-sieve
still achieves a 23% deduplication ratio for the Windows
XP SP2 system. Similar results are also seen with other
virtual machine images. In addition, we observe some
interesting phenomena.
First, different applications within the same operating
system have limited data duplication, such as SP2+O3
and SP2+V5. Also, most of the data redundancy exists
in different service packs of the same application.
As an example, the deduplication result of SP3+O3
achieves at least 86.54% data reduction compared to
SP2+O3. Third, we can clearly see the changes
of Windows systems and their applications iteration
versions from the variety of deduplication ratio as
shown in Fig. 5b, which also points us a sensible
deployments for the VM.
The second part of our evaluation is to test I-sieve
foreground performance to confirm the feasibility of
inline deduplication. We simulate common I/O features
using IOmeter, and compare with three different
models, the traditional iSCSI (RAW for short) model,
I-sieve deduplication with the cache module (DC), and
I-sieve deduplication with the cache and bloom filter
modules (DCB).
As shown in Fig. 6, I-sieve outperforms RAW
under all six cases; especially for the random write
application, the response time of DC is only 0.7649 ms,
with an approximately 91% improvement over RAW.
This is because all I/O requests to I-sieve are cached by
our multi-cache module both for sequential and random
I/Os. RAW only gives the optimizations for sequential
operations; small sequential I/Os are combined with a
big sequential data block giving to underlying disks to
improve performance.
From Fig. 6, one can see that I-sieve shields I/O
Fig. 6 Evaluations of the foreground performance in I-sieve.
All results are obtained using the IOmeter tool installed on
the client. The labels R, W, Seq, and Rnd in the legend
represent read operation, write operation, sequence, and
random I/O modes, respectively.
features of six types of applications, and shows good
approximation of foreground performance compared to
RAW. The reason is that the structure of I-sieve is
below the file system; consequently, it does not know
the differences among the I/O requests, such as a read
I/O request or a write I/O request, and all I/O requests
go through the complete deduplication process. As
also can be seen from the figure, I-sieve with bloom
filter module (DCB) shows slightly better performance
improvement compared with DC, and the results prove
the impact of the bloom filter on the speed of block
retrieval.
6 Related Works
Deduplication technology has been widely used in
most inline storage systems including primary[6]
and secondary storage[25]
. There are also some
differences in deduplicaton granularity. In file level
deduplication[12]
, the deduplication ratio is not
obvious[13, 29]
when compared with block level
deduplication[4–7]
or sub-file level[8–11]
. Therefore,
block chunking algorithms are often used to realize
block level deduplicaton, including fixed-sized and
variable-sized algorithms. Some applications (e.g.,
office, virtual machine environments, etc.) are well
suited for fixed-sized deduplication[29, 33]
.
To improve foreground performance for inline
deduplication, iDedup[6]
, a latency-aware inline data
deduplication system for primary storage describes a
novel mechanism to reduce latency. It takes advantage
of the spatial locality and temporal nature of the primary
workloads in order to avoid extra disk I/Os and seeks.
iDedup is complementary to I-sieve, and together they
can further shorten response time.
Besides the optimization based on workloads, Chunk-
Stash[7]
uses a three-tier deduplication architecture by
adding a fast device, SSD. By utilizing SSD, the penalty
of index lookup misses in RAM is reduced by orders of
magnitude because such lookups are served from a fast,
flash-based index. In the same spirit, we also employ
SSD as a strategy for improving performance in I-sieve.
In contrast to current optimizations for inline
deduplication systems, I-sieve improves performance
because of a shorter deduplication operation path and
multi-cache structure. I-sieve can serve as an I/O filter
between file I/O and disk I/O. There is no processing
logic for data, such as “container” or “bin” structures in
other approaches.
www.redpel.com +917620593389
www.redpel.com +917620593389
Jibin Wang et al.: I-sieve: An Inline High Performance Deduplication System Used in Cloud Storage 25
7 Conclusions
In this paper, we propose I-sieve, a high performance
inline deduplication system for use in cloud storage.
We design novel index tables to satisfy the I-sieve
architecture, since it is a bridge between frontend
and backend systems. We also implement a prototype
of I-sieve based on iSCSI, and evaluate it with
virtual machines and the IOmeter tool. We present
our detailed test results, and demonstrate that I-
sieve has excellent foreground performance compared
with traditional iSCSI applications. In addition, I-
sieve is also suitable for deduplication in small
storage environments, especially with virtual machine
applications. Finally, I-sieve can co-exist with existing
deduplication systems as long as they support the iSCSI
protocol.
Acknowledgements
Thanks to Wei Huang for the work of programming
of I-sieve prototype. We also would like to thank the
anonymous reviewers for their valuable insights that have
improved the quality of the paper greatly. This work
was supported by the Young Scholars of the Shandong
Academy of Science (No. 2014QN013) and the National
High-Tech Research and Development (863) Program of
China (No. 2012AA011202).
References
[1] EMC Corporation, The emc digital universe study,
Technical Report, 2014.
[2] International Data Corporation, The 2011 digital universe
study, Technical Report, 2011.
[3] D. R. Merrill, Storage economics: Four principles for
reducing total cost of ownership, http://www.hds.com/
assets/pdf/four-principles-for-reducing-total-cost-of-
ownership.pdf, 2011.
[4] A. T. Clements, I. Ahmad, M. Vilayannur, and J. Li,
Decentralized deduplication in san cluster file systems, in
Proceedings of the 2009 Conference on USENIX Annual
Technical Conference, 2009.
[5] E. Kruus, C. Ungureanu, and C. Dubnicki, Bimodal
content defined chunking for backup streams, in
Proceedings of the 8th USENIX Conference on File and
Storage Technologies, 2010.
[6] K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti,
idedup: Latency-aware, inline data deduplication for
primary storage, in Proceedings of the 10th USENIX
Conference on File and Storage Technologies, 2012.
[7] B. Debnath, S. Sengupta, and J. Li, Chunkstash: Speeding
up inline storage deduplication using flash memory, in
Proceedings of the Annual Conference on USENIX Annual
Technical Conference, 2010.
[8] C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W.
Kilian, P. Strzelczak, J. Szczepkowski, C. Ungureanu, and
M. Welnicki, Hydrastor: A scalable secondary storage, in
Proccedings of the 7th Conference on File and Storage
Technologies, 2009, pp. 197–210.
[9] W. Dong, F. Douglis, K. Li, H. Patterson, S. Reddy,
and P. Shilane, Tradeoffs in scalable data routing for
deduplication clusters, in Proceedings of the 9th USENIX
Conference on File and Storage Technologies, 2011.
[10] D. Bhagwat, K. Eshghi, D. D. E. Long, and M. Lillibridge,
Extreme binning: Scalable, parallel deduplication
for chunk-based file backup, in Proceedings of the
IEEE International Symposium on Modeling, Analysis
Simulation of Computer and Telecommunication Systems,
2009, pp. 1–9.
[11] C. Ungureanu, B. Atkin, A. Aranya, S. Gokhale, S. Rago,
G. Calkowski, C. Dubnicki, and A. Bohra, Hydrafs: A
high-throughput file system for the hydrastor content-
addressable storage system, in Proceedings of the 8th
USENIX Conference on File and Storage Technologies,
2010.
[12] W. J. Bolosky, S. Corbin, D. Goebel, and J. R. Douceur,
Single instance storage in windows 2000, in Proceedings
of the 4th Conference on USENIX Windows Systems
Symposium, 2000.
[13] C. Policroniades and I. Pratt, Alternatives for detecting
redundancy in storage systems data, in Proceedings of
the Annual Conference on USENIX Annual Technical
Conference, 2004.
[14] C. Liu, Y. Lu, C. Shi, G. Lu, D.-C. Du, and D.-S. Wang,
Admad: Application-driven metadata aware deduplication
archival storage system, in Proceedings of the Fifth IEEE
International Workshop on Storage Network Architecture
and Parallel I/Os, 2008, pp. 29–35.
[15] S. Quinlan and S. Dorward, Venti: A new approach to
archival data storage, in Proceedings of the 1st USENIX
Conference on File and Storage Technologies, 2002.
[16] B. Zhu, K. Li, and H. Patterson, Avoiding the disk
bottleneck in the data domain deduplication file system, in
Proceedings of the 6th USENIX Conference on File and
Storage Technologies, 2008, pp. 1–14.
[17] J. A. Paulo and J. Pereira, A survey and classification of
storage deduplication systems, ACM Computing Surveys,
vol. 47, no. 1, pp. 11: 1–11: 30, 2014.
[18] F. Liu, Y. Sun, B. Li, B. Li, and X. Zhang, Fs2you: Peer-
assisted semipersistent online hosting at a large scale, IEEE
Transactions on Parallel and Distributed Systems, vol. 21,
no. 10, pp. 1442–1457, 2010.
[19] F. Liu, B. Li, B. Li, and H. Jin, Peer-assisted on-demand
streaming: Characterizing demands and optimizing
supplies, IEEE Transactions on Computers, vol. 62, no. 2,
pp. 351–361, 2013.
[20] F. Xu, F. Liu, L. Liu, H. Jin, B. Li, and B. Li, iaware:
Making live migration of virtual machines interference-
aware in the cloud, IEEE Transactions on Computers, vol.
63, no. 12, pp. 3012–3025, 2014.
[21] F. Xu, F. Liu, H. Jin, and A. Vasilakos, Managing
performance overhead of virtual machines in cloud
computing: A survey, state of the art, and future directions,
Proceedings of the IEEE, vol. 102, no. 1, pp. 11–31, 2014.
www.redpel.com +917620593389
www.redpel.com +917620593389
26 Tsinghua Science and Technology, February 2015, 20(1): 17-27
[22] M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar,
G. Trezise, and P. Camble, Sparse indexing: Large
scale, inline deduplication using sampling and locality, in
Proccedings of the 7th Conference on File and Storage
Technologies, 2009, pp. 111–123.
[23] A. Muthitacharoen, B. Chen, and D. Mazi`eres, A low-
bandwidth network file system, in Proceedings of the 18th
ACM Symposium on Operating Systems Principles, 2001,
pp. 174–187.
[24] D. R. Bobbarjung, S. Jagannathan, and C. Dubnicki,
Improving duplicate elimination in storage systems,
Transactions on Storage, vol. 2, no. 4, pp. 424–448, 2006.
[25] D. Domain, Data domain appliance series: High-speed
inline deduplication storage, http://www.emc.com/backup-
and-recovery/data-domain/data-domain-deduplication-
storage-systems.htm, 2011.
[26] W. Leesakul, P. Townend, and J. Xu, Dynamic data
deduplication in cloud storage, in Proceedings of the 8th
IEEE International Symposium on Service Oriented System
Engineering, 2014, pp. 320–325.
[27] Y. Fu, H. Jiang, N. Xiao, L. Tian, F. Liu, and L.
Xu, Application-aware local-global source deduplication
for cloud backup services of personal storage, IEEE
Transactions on Parallel and Distributed Systems, vol. 25,
no. 5, pp. 1155–1165, 2014.
[28] P. Puzio, R. Molva, M. Onen, and S. Loureiro, Cloudedup:
Secure deduplication with encrypted data for cloud storage,
in Proceedings of the 5th IEEE International Conference
on Cloud Computing Technology and Science, 2013.
[29] D. T. Meyer and W. J. Bolosky, A study of practical
deduplication, in Proceedings of the 9th USENIX
Conference on File and Storage Technologies, 2011.
[30] K. Eshghi and H. K. Tang, A framework for analysing and
improving content-based chunking algorithms, Tech. Rep.
HPL–2005–30(RI), 2005.
[31] National Institute of Standards and Technology, FIPS PUB
180-1: Secure hash Standards, Technical Report, 1995.
[32] R. Rivest, The md5 message-digest algorithm,
http://www.ietf.org/rfc/rfc1321.txt, 1992.
[33] C.-H. Ng, M. Ma, T.-Y. Wong, P. P. C. Lee, and J.
C. S. Lui, Live deduplication storage of virtual machine
images in an open-source cloud, in Proceedings of the 12th
International Middleware Conference, 2011, pp. 80–99.
[34] C. Ungureanu, B. Atkin, A. Aranya, S. Gokhale, S. Rago,
G. Cakowski, C. Dubnicki, and A. Bohra, Hydrafs: A
high-throughput file system for the hydrastor content-
addressable storage system, in Proceedings of the 8th
USENIX Conference on File and Storage Technologies,
2010.
[35] B. H. Bloom, Space/time trade-offs in hash coding with
allowable errors, Communications of the ACM, vol. 13, no.
7, pp. 422–426, 1970.
[36] G. Lu, Y. J. Nam, and D.-C. Du, Bloomstore: Bloom-
filter based memory-efficient key-value store for indexing
of data deduplication on flash, in Proceedings of the
28th IEEE Symposium on Mass Storage Systems and
Technologies, 2012.
[37] iSCSI enterprise target, http://sourceforge.net/projects/
iscsitarget/files/, 2010.
[38] K. Jin and E. L. Miller, The effectiveness of deduplication
on virtual machine disk images, in Proceedings of SYSTOR
2009: The Israeli Experimental Systems Conference, 2009.
Jibin Wang received the bachelor degree
in computer science from Chengdu
University of Information Technology,
China, in 2007, the MS and PhD degrees
in computer science from Huazhong
University of Science and Technology,
China, in 2010 and 2013, respectively.
He is currently working in Shandong
Computer Science Center (National Supercomputer Center in
Jinan). His research interests include computer architecture
networked storage system, parallel and distributed system, and
cloud computing.
Zhigang Zhao received the BS and
MS degrees in management science and
engineering from Shandong Normal
University, China, in 2002 and 2005,
respectively. Presently, he is a research
assistant in the Shandong Computer
Science Center (National Supercomputer
Center in Jinan). His research interests
include cloud computing, service-oriented computing, and big
data.
Zhaogang Xu received the bachelor
degree in network engineering from
Shandong University of Science and
Technology, China, in 2010, and the MS
degree in computer architecture from
Shandong University of Science and
Technology, China, in 2013. Now, he
works in the Shandong Computer Science
Center (National Supercomputer Center in Jinan), Shandong
Provincial Key Laboratory of Computer Networks. His research
interests include cloud computing and big data storage.
Hu Zhang received the bachelor degree
in electronic information science and
technology from Liaocheng University
of Information Technology, China, in
2009, and the MS degree in signal and
information processing from Southwest
Jiaotong University, China, in 2013.
Presently, he is an engineer in the
Shandong Computer Science Center (National Supercomputer
Center in Jinan), Shandong Provincial Key Laboratory of
Computer Networks. His research interests include cloud
computing and big data mining.
www.redpel.com +917620593389
www.redpel.com +917620593389
Jibin Wang et al.: I-sieve: An Inline High Performance Deduplication System Used in Cloud Storage 27
Liang Li received the bachelor degree
in computer science from Shandong
University of Science and Technology,
China in 2003, MS degree in computer
science from Shandong University in
2006. Presently, she is a research assistant
in the Shandong Computer Science Center
(National Supercomputer Center in Jinan),
Shandong Provincial Key Laboratory of Computer Networks.
Her research interests include service-oriented computing and
big data mining.
Ying Guo received the bachelor degree in
information management from Shandong
University of Finance and Economics,
China, in 2000, the MS degree in enterprise
management from Dongbei University of
Finance and Economics, China, in 2004.
She is an associate research professor and
currently working for Shandong Computer
Science Center. Her research interests include cloud computing
and software engineering.
www.redpel.com +917620593389
www.redpel.com +917620593389

Weitere ähnliche Inhalte

Was ist angesagt?

KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 
Native erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationNative erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationlin bao
 
Lecture 25
Lecture 25Lecture 25
Lecture 25Shani729
 
Oracle real application_cluster
Oracle real application_clusterOracle real application_cluster
Oracle real application_clusterPrabhat gangwar
 
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster IJECEIAES
 
ACIC: Automatic Cloud I/O Configurator for HPC Applications
ACIC: Automatic Cloud I/O Configurator for HPC ApplicationsACIC: Automatic Cloud I/O Configurator for HPC Applications
ACIC: Automatic Cloud I/O Configurator for HPC ApplicationsMingliang Liu
 
Mass storagestructure pre-final-formatting
Mass storagestructure pre-final-formattingMass storagestructure pre-final-formatting
Mass storagestructure pre-final-formattingmarangburu42
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureKyong-Ha Lee
 
Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of HadoopKnoldus Inc.
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Big table
Big tableBig table
Big tablePSIT
 
Autonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with CassandraAutonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with CassandraEmiliano
 
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingZhe Zhang
 
Themis: An I/O-Efficient MapReduce (SoCC 2012)
Themis: An I/O-Efficient MapReduce (SoCC 2012)Themis: An I/O-Efficient MapReduce (SoCC 2012)
Themis: An I/O-Efficient MapReduce (SoCC 2012)Alex Rasmussen
 

Was ist angesagt? (20)

IJET-V3I2P24
IJET-V3I2P24IJET-V3I2P24
IJET-V3I2P24
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
hpc2013_20131223
hpc2013_20131223hpc2013_20131223
hpc2013_20131223
 
Native erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationNative erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentation
 
Parallel io
Parallel ioParallel io
Parallel io
 
Lecture 25
Lecture 25Lecture 25
Lecture 25
 
Oracle real application_cluster
Oracle real application_clusterOracle real application_cluster
Oracle real application_cluster
 
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
Post Event Investigation of Multi-stream Video Data Utilizing Hadoop Cluster
 
ACIC: Automatic Cloud I/O Configurator for HPC Applications
ACIC: Automatic Cloud I/O Configurator for HPC ApplicationsACIC: Automatic Cloud I/O Configurator for HPC Applications
ACIC: Automatic Cloud I/O Configurator for HPC Applications
 
Mass storagestructure pre-final-formatting
Mass storagestructure pre-final-formattingMass storagestructure pre-final-formatting
Mass storagestructure pre-final-formatting
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing Architecture
 
Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of Hadoop
 
No sql
No sqlNo sql
No sql
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
HDF5 I/O Performance
HDF5 I/O PerformanceHDF5 I/O Performance
HDF5 I/O Performance
 
ADAPTER
ADAPTERADAPTER
ADAPTER
 
Big table
Big tableBig table
Big table
 
Autonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with CassandraAutonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with Cassandra
 
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
 
Themis: An I/O-Efficient MapReduce (SoCC 2012)
Themis: An I/O-Efficient MapReduce (SoCC 2012)Themis: An I/O-Efficient MapReduce (SoCC 2012)
Themis: An I/O-Efficient MapReduce (SoCC 2012)
 

Ähnlich wie I-Sieve: An inline High Performance Deduplication System Used in cloud storage

Efficient usage of memory management in big data using “anti caching”
Efficient usage of memory management in big data using “anti caching”Efficient usage of memory management in big data using “anti caching”
Efficient usage of memory management in big data using “anti caching”eSAT Journals
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...dbpublications
 
In memory big data management and processing a survey
In memory big data management and processing a surveyIn memory big data management and processing a survey
In memory big data management and processing a surveyredpel dot com
 
A Strategy for Improving the Performance of Small Files in Openstack Swift
 A Strategy for Improving the Performance of Small Files in Openstack Swift  A Strategy for Improving the Performance of Small Files in Openstack Swift
A Strategy for Improving the Performance of Small Files in Openstack Swift Editor IJCATR
 
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...IBM India Smarter Computing
 
Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...IJECEIAES
 
Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudIJAAS Team
 
Hybrid Task Scheduling Approach using Gravitational and ACO Search Algorithm
Hybrid Task Scheduling Approach using Gravitational and ACO Search AlgorithmHybrid Task Scheduling Approach using Gravitational and ACO Search Algorithm
Hybrid Task Scheduling Approach using Gravitational and ACO Search AlgorithmIRJET Journal
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...IOSR Journals
 
TechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebulaTechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebulaOpenNebula Project
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfHasanAfwaaz1
 
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...Vijay Prime
 
Why is Virtualization Creating Storage Sprawl? By Storage Switzerland
Why is Virtualization Creating Storage Sprawl? By Storage SwitzerlandWhy is Virtualization Creating Storage Sprawl? By Storage Switzerland
Why is Virtualization Creating Storage Sprawl? By Storage SwitzerlandINFINIDAT
 
IRJET- Improving Data Availability by using VPC Strategy in Cloud Environ...
IRJET-  	  Improving Data Availability by using VPC Strategy in Cloud Environ...IRJET-  	  Improving Data Availability by using VPC Strategy in Cloud Environ...
IRJET- Improving Data Availability by using VPC Strategy in Cloud Environ...IRJET Journal
 

Ähnlich wie I-Sieve: An inline High Performance Deduplication System Used in cloud storage (20)

[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
 
Efficient usage of memory management in big data using “anti caching”
Efficient usage of memory management in big data using “anti caching”Efficient usage of memory management in big data using “anti caching”
Efficient usage of memory management in big data using “anti caching”
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
 
Facade
FacadeFacade
Facade
 
In memory big data management and processing a survey
In memory big data management and processing a surveyIn memory big data management and processing a survey
In memory big data management and processing a survey
 
A Strategy for Improving the Performance of Small Files in Openstack Swift
 A Strategy for Improving the Performance of Small Files in Openstack Swift  A Strategy for Improving the Performance of Small Files in Openstack Swift
A Strategy for Improving the Performance of Small Files in Openstack Swift
 
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...
 
Srinivasan2-10-12
Srinivasan2-10-12Srinivasan2-10-12
Srinivasan2-10-12
 
Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...
 
Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with Cloud
 
Hybrid Task Scheduling Approach using Gravitational and ACO Search Algorithm
Hybrid Task Scheduling Approach using Gravitational and ACO Search AlgorithmHybrid Task Scheduling Approach using Gravitational and ACO Search Algorithm
Hybrid Task Scheduling Approach using Gravitational and ACO Search Algorithm
 
PowerAlluxio
PowerAlluxioPowerAlluxio
PowerAlluxio
 
H017144148
H017144148H017144148
H017144148
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
 
TechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebulaTechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebula
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdf
 
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
 
SiDe Enabled Reliable Replica Optimization
SiDe Enabled Reliable Replica OptimizationSiDe Enabled Reliable Replica Optimization
SiDe Enabled Reliable Replica Optimization
 
Why is Virtualization Creating Storage Sprawl? By Storage Switzerland
Why is Virtualization Creating Storage Sprawl? By Storage SwitzerlandWhy is Virtualization Creating Storage Sprawl? By Storage Switzerland
Why is Virtualization Creating Storage Sprawl? By Storage Switzerland
 
IRJET- Improving Data Availability by using VPC Strategy in Cloud Environ...
IRJET-  	  Improving Data Availability by using VPC Strategy in Cloud Environ...IRJET-  	  Improving Data Availability by using VPC Strategy in Cloud Environ...
IRJET- Improving Data Availability by using VPC Strategy in Cloud Environ...
 

Mehr von redpel dot com

An efficient tree based self-organizing protocol for internet of things
An efficient tree based self-organizing protocol for internet of thingsAn efficient tree based self-organizing protocol for internet of things
An efficient tree based self-organizing protocol for internet of thingsredpel dot com
 
Validation of pervasive cloud task migration with colored petri net
Validation of pervasive cloud task migration with colored petri netValidation of pervasive cloud task migration with colored petri net
Validation of pervasive cloud task migration with colored petri netredpel dot com
 
Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...
Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...
Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...redpel dot com
 
Towards a virtual domain based authentication on mapreduce
Towards a virtual domain based authentication on mapreduceTowards a virtual domain based authentication on mapreduce
Towards a virtual domain based authentication on mapreduceredpel dot com
 
Toward a real time framework in cloudlet-based architecture
Toward a real time framework in cloudlet-based architectureToward a real time framework in cloudlet-based architecture
Toward a real time framework in cloudlet-based architectureredpel dot com
 
Protection of big data privacy
Protection of big data privacyProtection of big data privacy
Protection of big data privacyredpel dot com
 
Privacy preserving and delegated access control for cloud applications
Privacy preserving and delegated access control for cloud applicationsPrivacy preserving and delegated access control for cloud applications
Privacy preserving and delegated access control for cloud applicationsredpel dot com
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...redpel dot com
 
Frequency and similarity aware partitioning for cloud storage based on space ...
Frequency and similarity aware partitioning for cloud storage based on space ...Frequency and similarity aware partitioning for cloud storage based on space ...
Frequency and similarity aware partitioning for cloud storage based on space ...redpel dot com
 
Multiagent multiobjective interaction game system for service provisoning veh...
Multiagent multiobjective interaction game system for service provisoning veh...Multiagent multiobjective interaction game system for service provisoning veh...
Multiagent multiobjective interaction game system for service provisoning veh...redpel dot com
 
Efficient multicast delivery for data redundancy minimization over wireless d...
Efficient multicast delivery for data redundancy minimization over wireless d...Efficient multicast delivery for data redundancy minimization over wireless d...
Efficient multicast delivery for data redundancy minimization over wireless d...redpel dot com
 
Cloud assisted io t-based scada systems security- a review of the state of th...
Cloud assisted io t-based scada systems security- a review of the state of th...Cloud assisted io t-based scada systems security- a review of the state of th...
Cloud assisted io t-based scada systems security- a review of the state of th...redpel dot com
 
Bayes based arp attack detection algorithm for cloud centers
Bayes based arp attack detection algorithm for cloud centersBayes based arp attack detection algorithm for cloud centers
Bayes based arp attack detection algorithm for cloud centersredpel dot com
 
Architecture harmonization between cloud radio access network and fog network
Architecture harmonization between cloud radio access network and fog networkArchitecture harmonization between cloud radio access network and fog network
Architecture harmonization between cloud radio access network and fog networkredpel dot com
 
Analysis of classical encryption techniques in cloud computing
Analysis of classical encryption techniques in cloud computingAnalysis of classical encryption techniques in cloud computing
Analysis of classical encryption techniques in cloud computingredpel dot com
 
An anomalous behavior detection model in cloud computing
An anomalous behavior detection model in cloud computingAn anomalous behavior detection model in cloud computing
An anomalous behavior detection model in cloud computingredpel dot com
 
A tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big dataA tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big dataredpel dot com
 
A parallel patient treatment time prediction algorithm and its applications i...
A parallel patient treatment time prediction algorithm and its applications i...A parallel patient treatment time prediction algorithm and its applications i...
A parallel patient treatment time prediction algorithm and its applications i...redpel dot com
 
A mobile offloading game against smart attacks
A mobile offloading game against smart attacksA mobile offloading game against smart attacks
A mobile offloading game against smart attacksredpel dot com
 
A distributed video management cloud platform using hadoop
A distributed video management cloud platform using hadoopA distributed video management cloud platform using hadoop
A distributed video management cloud platform using hadoopredpel dot com
 

Mehr von redpel dot com (20)

An efficient tree based self-organizing protocol for internet of things
An efficient tree based self-organizing protocol for internet of thingsAn efficient tree based self-organizing protocol for internet of things
An efficient tree based self-organizing protocol for internet of things
 
Validation of pervasive cloud task migration with colored petri net
Validation of pervasive cloud task migration with colored petri netValidation of pervasive cloud task migration with colored petri net
Validation of pervasive cloud task migration with colored petri net
 
Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...
Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...
Web Service QoS Prediction Based on Adaptive Dynamic Programming Using Fuzzy ...
 
Towards a virtual domain based authentication on mapreduce
Towards a virtual domain based authentication on mapreduceTowards a virtual domain based authentication on mapreduce
Towards a virtual domain based authentication on mapreduce
 
Toward a real time framework in cloudlet-based architecture
Toward a real time framework in cloudlet-based architectureToward a real time framework in cloudlet-based architecture
Toward a real time framework in cloudlet-based architecture
 
Protection of big data privacy
Protection of big data privacyProtection of big data privacy
Protection of big data privacy
 
Privacy preserving and delegated access control for cloud applications
Privacy preserving and delegated access control for cloud applicationsPrivacy preserving and delegated access control for cloud applications
Privacy preserving and delegated access control for cloud applications
 
Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...Performance evaluation and estimation model using regression method for hadoo...
Performance evaluation and estimation model using regression method for hadoo...
 
Frequency and similarity aware partitioning for cloud storage based on space ...
Frequency and similarity aware partitioning for cloud storage based on space ...Frequency and similarity aware partitioning for cloud storage based on space ...
Frequency and similarity aware partitioning for cloud storage based on space ...
 
Multiagent multiobjective interaction game system for service provisoning veh...
Multiagent multiobjective interaction game system for service provisoning veh...Multiagent multiobjective interaction game system for service provisoning veh...
Multiagent multiobjective interaction game system for service provisoning veh...
 
Efficient multicast delivery for data redundancy minimization over wireless d...
Efficient multicast delivery for data redundancy minimization over wireless d...Efficient multicast delivery for data redundancy minimization over wireless d...
Efficient multicast delivery for data redundancy minimization over wireless d...
 
Cloud assisted io t-based scada systems security- a review of the state of th...
Cloud assisted io t-based scada systems security- a review of the state of th...Cloud assisted io t-based scada systems security- a review of the state of th...
Cloud assisted io t-based scada systems security- a review of the state of th...
 
Bayes based arp attack detection algorithm for cloud centers
Bayes based arp attack detection algorithm for cloud centersBayes based arp attack detection algorithm for cloud centers
Bayes based arp attack detection algorithm for cloud centers
 
Architecture harmonization between cloud radio access network and fog network
Architecture harmonization between cloud radio access network and fog networkArchitecture harmonization between cloud radio access network and fog network
Architecture harmonization between cloud radio access network and fog network
 
Analysis of classical encryption techniques in cloud computing
Analysis of classical encryption techniques in cloud computingAnalysis of classical encryption techniques in cloud computing
Analysis of classical encryption techniques in cloud computing
 
An anomalous behavior detection model in cloud computing
An anomalous behavior detection model in cloud computingAn anomalous behavior detection model in cloud computing
An anomalous behavior detection model in cloud computing
 
A tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big dataA tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big data
 
A parallel patient treatment time prediction algorithm and its applications i...
A parallel patient treatment time prediction algorithm and its applications i...A parallel patient treatment time prediction algorithm and its applications i...
A parallel patient treatment time prediction algorithm and its applications i...
 
A mobile offloading game against smart attacks
A mobile offloading game against smart attacksA mobile offloading game against smart attacks
A mobile offloading game against smart attacks
 
A distributed video management cloud platform using hadoop
A distributed video management cloud platform using hadoopA distributed video management cloud platform using hadoop
A distributed video management cloud platform using hadoop
 

Kürzlich hochgeladen

The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 

Kürzlich hochgeladen (20)

The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 

I-Sieve: An inline High Performance Deduplication System Used in cloud storage

  • 1. TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll03/11llpp17-27 Volume 20, Number 1, February 2015 I-sieve: An Inline High Performance Deduplication System Used in Cloud Storage Jibin Wang, Zhigang Zhao, Zhaogang Xu, Hu Zhang, Liang Li, and Ying Guo Abstract: Data deduplication is an emerging and widely employed method for current storage systems. As this technology is gradually applied in inline scenarios such as with virtual machines and cloud storage systems, this study proposes a novel deduplication architecture called I-sieve. The goal of I-sieve is to realize a high performance data sieve system based on iSCSI in the cloud storage system. We also design the corresponding index and mapping tables and present a multi-level cache using a solid state drive to reduce RAM consumption and to optimize lookup performance. A prototype of I-sieve is implemented based on the open source iSCSI target, and many experiments have been conducted driven by virtual machine images and testing tools. The evaluation results show excellent deduplication and foreground performance. More importantly, I-sieve can co-exist with the existing deduplication systems as long as they support the iSCSI protocol. Key words: I-sieve; cloud storage; data deduplication 1 Introduction With the rapid development of information technology, large amounts of data are stored in different storage systems. According to EMC reports[1] , the amount of data will reach 44 zettabytes by 2020. At the same time, the total shipped disk storage systems capacity only reached 7.1 exabytes in 2011 based on IDC reports[2] . There is a huge gap between data production and available storage capacity for the foreseeable future, which means that much of the information will get discarded. On the other hand, even though disk cost per GB has seen significant declines in recent years, the total storage costs have revealed a general trend of rapid increase[3] . One of the promising technologies for Jibin Wang, Zhigang Zhao, Zhaogang Xu, Hu Zhang, Liang Li, and Ying Guo are with the Shandong Computer Science Center (National Supercomputer Center in Jinan), Shandong Provincial Key Laboratory of Computer Networks, Jinan 250101, China. E-mail: fwangjb, zhaozhg, xuzhaogang, zhanghu, liliang, guoyingg@sdas.org. To whom correspondence should be addressed. Manuscript received: 2014-11-15; revised: 2014-12-15; accepted: 2015-01-07 improving storage efficiency is deduplication, which is considered to be one of the best optimizations to reduce storage costs. This technology can effectively eliminate redundant data blocks by using pointers to reference an already stored identical data block in the storage system. Deduplication can work with many data types, including block[4–7] and sub-file[8–11] , also known as super-block or fixed-size container file, whole-file[12] , and hybrid-file[13, 14] . Historically, data deduplication technology is only applied on the backup system[15–17] . This is because backup applications consist primarily of duplicate data, especially for cyclical full backup applications. Another reason is that most backup systems do not need a high foreground performance. The main indicators to evaluate backup system performance are data throughput and deduplication ratio. Similar to optimizations in cloud applications[18, 19] or cloud services[20, 21] , there are many promising studies around performance optimizations resulting from deduplication. Prior work on deduplication has primarily focused on the optimization of the current backup systems, including chunking algorithms[5, 22–24] and performance optimizations[7, 16] . Many of the www.redpel.com +917620593389 www.redpel.com +917620593389
  • 2. 18 Tsinghua Science and Technology, February 2015, 20(1): 17-27 solutions have been applied in the current commercial systems as the secondary storage. With the widespread use of primary storage systems for storing large amounts of data, such as online video, user directories, and VM images, there has been more researches[6, 7, 22, 25] focusing on inline deduplicaton for primary workloads rather than backup environments only, especially with cloud storage[26–28] . The challenge with inline deduplication is to determine mechanisms to decrease the high latency introduced due to deduplication operations, including hash table indexing and metadata management. Ideally, an in-memory hash table can decrease latency effectively, however, the reality is that only a small subset of related tables can be resident in the limited memory due to the large scale of the current deduplication systems; as a result, most of them are migrated to disk. Another popular research area is to determine how to speed-up the table lookup process and reduce the number of chunk hashes in RAM. There are three types of optimizations to reduce the size of the hash table in RAM-reducing hash length, using a big chunk size, and using a sampling hash. It is simple to compute a hash using a shorter hash function (e.g., 32-bit MD5 or 40-bit SHA-1), however, there are challenges that arise as a result. Specifically, as the number of the chunks increase, more hash collisions can occur for different chunk blocks, which can lead to data inconsistency in the storage system. A large average chunk size in the chunking algorithm can be used to as an efficient way to cut down the total number of chunks in RAM. However, this strategy can reduce the deduplication ratio since the average chunk size in the current file system is still small[29] . Separate from the above two methods, the sampling hash is widely used in many deduplication systems, especially for large scale storage systems. It is also relatively suitable for combining with hash and cache structures to improve index performance; therefore, we also utilize the sampling hash in our I- sieve design. Given the significant difference in performance between RAM memory and disk, fast storage devices such as flash memory (typically known as Solid State Drive (SSD) can be a feasible solution in deduplication systems. It can play the role of a buffer cache, effectively narrowing the gap between memory and disk. In ChunkStash[7] , all metadata is stored in SSD in order to improve indexing performance; this optimization only shows marginal loss in the deduplication quality. Compared to increasing memory capacity, using flash memory SSD is a more effective way to improve lookup performance. This study describes the design, implementation, and evaluation of our deduplication system (called I-sieve) based on the open source code iSCSI. We also propose a series of optimizations to improve inline performance for primary workloads. Our evaluation shows up to an 82.96% data savings exist in office environments, together with an improvement of about 91% in the user response time in case of common I/O applications. In summary, our key contributions primarily include the following. Analysis of the features of current deduplication systems together with a detailed summary of their challenges. An inline deduplication system (I-sieve), for primary workloads, that serves as a module embedded in the cloud storage system. Design of novel index tables aimed at block level deduplication and a new multi-cache structure using SSD to boost foreground performance. Implementation of an I-sieve prototype in the Linux operating system based on an open source code iSCSI target. Evaluation of our I-sieve performance by simulating real world applications such as office and VM environments. The remainder of this paper is organized as follows. In Section 2, we provide the background description and motivation for this paper. In Section 3, we present the design of the I-sieve, including the indexing table and the multi-level cache. The evaluation method, experimental settings, and workloads are presented in Section 4 followed by numerical results and analysis in Section 5. In Section 6, we give an overview of related work, and in Section 7, we conclude the paper. 2 Background and Motivation Before discussing our deduplication system, we first introduce the traditional deduplication process in current storage systems, and subsequently analyze its strengths and weaknesses. 2.1 Traditional deduplication system As mentioned above, data deduplication is one of the popular data reduction technologies, which has been applied in many backup storage systems[8–11] . In Fig. 1, we summarize the general deduplication process. There www.redpel.com +917620593389 www.redpel.com +917620593389
  • 3. Jibin Wang et al.: I-sieve: An Inline High Performance Deduplication System Used in Cloud Storage 19 Fig. 1 I/O request process in traditional deduplication systems. are two major I/O operations for any storage system, read and write. As shown in Fig. 1a, when the system deals with write I/O requests from the frontend, data deduplication operations happen on the write path of the file system, and the following four steps are performed. (1) File chunking. In this step, all files are divided into different types of blocks by the chunking algorithms (e.g., fixed-sized blocks[15] and variable- sized blocks[5, 24, 30] ). The system then computes a fingerprint of each block using a hash function such as a SHA-1 hash[31] or an MD5 hash[32] . Subsequently, all processed blocks with corresponding fingerprints are handled by the next deduplication module. (2) Deduplication checking. In this stage, the fingerprint of split blocks is checked against the fingerprint table. If the block has a duplicate in the table, then this block is not written to disk; instead only the metadata is recorded (similar to the organization of super blocks or block groups in the EXT file system). However, if the block does not have a duplicate in the table, then the fingerprint of this block is inserted into fingerprint table as a unique block in the system. This step involves an extra fingerprint indexing process to be performed to check for duplicate data when compared with the traditional I/O process. (3) Metadata updating. As the block indexing process completes, the corresponding metadata of the file is updated, including the mapping of blocks and files based on pointers. Subsequently, the metadata is stored in designated zones on the disk by periodic destage operations. (4) Data blocks storage. The final step in the write path is to store the unique data blocks. The storage elements in most current storage system are organized by containers[7] or bins[10] , which contain a specified number of data blocks on the disk. When the container or bin reaches the target size, it is sealed and written from RAM memory to disk. In addition, some extra block metadata is recorded to identify data blocks belonging to a container or file. Figure 1b shows a data view of the traditional deduplication system after storing two files, FileA and FileB. As seen in the figure, blocks a and b are the deduplicated data blocks in those two files. Only a single copy is stored on the disk and appropriate metadata is also created during this process. As seen in Fig. 1c, the process of handling read I/O for FileA is inverse of the process with write I/O. First, the system reads the corresponding metadata information, and subsequently obtains all blocks associated with FileA. Next, the system reads the relevant blocks from disk. Since the requested blocks may exist in various containers or bins on the disk, this step can be a time consuming process requiring more than one I/O request. Finally, the data blocks are returned back to the system. Based on the processing flow, we can see that the traditional deduplication module is tightly bound to the file system in the entire read and write path-embedded in the reformed file system such as liveDFS[33] or in a specific file system like Venti[15] or HydraFS[34] . In other words, current deduplication systems are co- evolved with their applications, and are not suitable to deploy as an independent module. In addition, the complex metadata management model also restricts the performance exertion, especially for the mappings in the deduplication operation work path, even though some optimizations[6, 7, 16, 22] improve deduplication performance to some extent. 2.2 Classifying deduplication systems Before designing a deduplication system, we first describe the classification of deduplication systems to be better suited for the required applications. Currently, there are many classifications for deduplication www.redpel.com +917620593389 www.redpel.com +917620593389
  • 4. 20 Tsinghua Science and Technology, February 2015, 20(1): 17-27 systems. Primary versus secondary deduplication systems. This classification is based on the workloads served. Primary deduplication systems are designed for improving performance, in which workloads are sensitive to I/O latency. Secondary deduplication systems are primarily used for secondary storage environments, such as backup applications, which require high data throughput. As mentioned above, data deduplication operations are time-consuming, which is why primary deduplication systems are rarely used in actual production environments. However, considering the small-scale application of secondary storage, we believe that current primary storage systems with deduplication are a feasible scheme since significant duplicate data exists in primary storage workloads as well. Post-processing versus inline deduplication. Post- processing deduplication is an out-of-band approach where data is not deduplicated until after the backup has completed. In other words, with post-processing deduplication, new data is first stored on the storage device and then processed at a later time for duplication. On the other hand, with inline deduplication, chunk hash calculations are created on the target device as data enters the storage devices in real-time. In this paper, we primarily focus on the inline deduplication system for its high space utilization and real-time characteristics. We summarize the current deduplication studies, and list the following observations that drive us to build a new deduplication system. More duplicate data exists in many typical environments. From the conclusions[29] , there exists approximately 50% duplicate data in the primary storage system and as high as 72% in backup datasets; all the data for analysis comes from different departments of Microsoft. Some cloud-based applications, such as live virtual machine environments, can achieve at least 40% data reduction[33] . The characteristics of data storage for those scenarios make it possible for us to build a small-scale, high performance deduplication system (e.g., private-cloud storage). Complex data management in the work path. The operations in a traditional deduplication system consist of read, write, and delete operations, with every operation path updates much of the information, including mapping table, block metadata, fingerprint table, and so on. All these work paths have optimization of space to boost the whole work performance. Poor performance in inline deduplication systems. Most of the current deduplication systems are used with backup scenarios. However, for real-time applications, these systems are not suitable due to their poor read or write performance. Given this, we propose an inline, block level primary deduplication system with high performance called I-sieve. The goal of I-sieve is to build a small-scale storage system for use in an office or private-cloud environments. In order to address the challenge of low performance in inline deduplication systems, a lightweight indexing table and a two-level cache structure are used to improve deduplication performance in I-sieve. The evaluations show excellent tradeoff of I-sieve between foreground performance and the deduplication ratio. 3 Structure of I-sieve and the Indexing Table Design In this section, we first review the architectural of I- sieve. Next, we present an efficient fingerprint indexing algorithm. Our I-sieve system always assumes that the total size of the storage system is 10 TB; all designs of the mapping table and cache are based on this assumption. 3.1 Structure of I-sieve As shown in Fig. 2, I-sieve is a backend storage module, and all the foreground clients access the storage service using the iSCSI protocol. Thus, I-sieve can be perceived as a data sieve, which eliminates much of the data redundancy seen in file systems. The implementation of I-sieve is a module imbedded into iSCSI; therefore, it can be incorporated with many storage applications, and also be easily deployed. As seen in Fig. 2, I-sieve consists of three key functional components, Deduplication engine, Multi- cache, and a Snapshot module. The Deduplication engine provides block-level data deduplication, including hash table (also called fingerprint table) management and an I/O remapping module. All the foreground write requests are handled within the deduplication engine, and then transferred to disk request queues. The Multi-cache module primarily handles cache requests, and also acts as a bridge that connects memory, SSD, and disk devices. The snapshot module www.redpel.com +917620593389 www.redpel.com +917620593389
  • 5. Jibin Wang et al.: I-sieve: An Inline High Performance Deduplication System Used in Cloud Storage 21 Fig. 2 Architectural overview of I-sieve. is responsible for the reliability of the deduplicated data blocks; it does so by performing periodic snapshot operations. From the view of each storage application, I- sieve plays the role of a bridge between file system and storage devices. All frontend I/O requests to storage devices are redirected to the deduplication engine first. As shown in Fig. 2, the deduplication engine is below the file system layer; consequently, the only work of this engine is to handle file I/O requests. Unlike other deduplication approaches, I- sieve has simpler data processing logic; thus, its operation work path is relatively shorter, which makes the deduplication process more efficient. This is because the deduplicated data blocks only need some remapping operations without needing more I/O request information, which is the biggest difference between block-level deduplication and other methods. 3.2 Design of the index tables As described in Fig. 2, the index tables, including the fingerprint table and other mapping tables, are required in every deduplication operation. The efficient design of the index table can result in great performance improvements, especially for the inline primary deduplication system. There are two challenges to be addressed in the design of the fingerprint index tables in I-sieve, low cost memory consumption and efficient remapping logic. Generally, the traditional fingerprint table uses the full hash index for data deduplication, however, this can result in memory exhaustion due to large data volumes. For example, if the deduplication system has 10 TB capacity and uses a 4 KB block size for chunking, then there are about 2:68 109 unique blocks within the system. Assuming that each hash entry in the index table takes 30 bytes, then the total memory consumption is about 75 GB just for the fingerprint table. Therefore, it is not a wise decision to hold the entire hash table in RAM memory. On the other hand, there is also a need for the remapping logic to redirect all frontend LBAs to their real positions on the disks, since blocks with different LBAs may have the same data content. In our index table design, we use the SHA-1 hash[31] , for its collision resistant properties, to compute the fingerprint. In total, there are 20 bytes in each entry with 160 bits of output for this hash function. As shown in Fig. 3, we resolve the above issues with three key tables, LBA Remapping Table (LRT), Multi-index Table (MT), and Hash node Table (HT). We provide more details below. LRT is responsible for remapping the frontend I/O requests to the fingerprint index table. The first item of each entry in LRT records the LBA from the top of the file system, and the second item stores pointer information of the fingerprint table. Considering the size of the fingerprint table, we present a novel structure, called MT, to reduce the table size. Instead of storing the full index table in RAM, MT is divided into many cells, called buckets, based on the first three 8 bits of the full fingerprint of the block. Every bucket has 256 entries with the same hash prefix. As MT increases, all identical prefix buckets are organized with linked lists level by level. In order to save space with the HT, we use the remaining 136 bits as the key in HT in the last level. The design motivation of MT has another consideration, which also facilitates the cache of buckets using the high performance of SSD. 3.3 Performance optimization using Bloom filter As can be seen from Fig. 3, even though the multi-index table can reduce the scale of the index table in memory, www.redpel.com +917620593389 www.redpel.com +917620593389
  • 6. 22 Tsinghua Science and Technology, February 2015, 20(1): 17-27 Fig. 3 Index table structures and the mapping process in I-sieve. there still are a large number of queries to check the deduplication fingerprint. This lookup process can be time-consuming, even with an SSD. Thus, we use a high-efficiency tool bloom filter[35] to optimize the indexing process. The description of the bloom filter is as follows. An empty bloom filter is a bit array of m bits, all set to 0. There must also be k independent hash functions defined, each of which maps or hashes one element to a position in the m-bit array with a uniform random distribution. When an element is added, it is fed to each of the k hash functions to get k array positions. Then the bits are set at all these positions to 1. In order to query for an element (test whether it is in the array), a check is performed to see if any of the bits at these positions are 0, and if so, the element is not in the array. If, however, all the bits are 1, then either the element is in the array, or the bits are set to 1 during the insertion of other elements, which results in a false positive. The idea is to allocate a vector to test whether a data block exists in the system; bloom filters provide a fast and space-efficient probabilistic data structure used in many deduplication systems[16, 36] . However, false positive retrieval results are possible in bloom filters, which can be represented by the following expression, pfalse positive D 1 Â 1 1 m Ãkn !k 1 e kn m Ák ; where k denotes the number of the independent hashes, m and n represent the table size (array size as mentioned above) and the number of blocks in the storage system respectively. In order to reduce the possibility of false positives in the duplicate checking process, we select 10 hash functions in I-sieve, which results in a false positive percentage of only about 0.098%. Assume that the total number of blocks n is 2:7 109 for a 10 TB storage system; then the memory consumption is about 4.5 GB, which is within the acceptable range. 3.4 Multi-level cache of I-sieve Besides the optimization of index performance, the organization of the index tables and data blocks is important for the overall system performance as well. Therefore, we introduce a flash-based storage device SSD acting as a fast cache between RAM memory and disk. Figure 4 shows the description of our multi-level cache structure. I-sieve uses a Cache Mapping Table (CMT) to manage different metadata and data caches in RAM and SSD. For the metadata, I-sieve uses two-level mappings, RAM-SSD, which means that the SSD stores all the metadata. Some of frequently-used metadata is migrated from SSD or disk to RAM in order to boost performance. Specifically, the first three levels of buckets are persisted in RAM, only taking up about 64 MB. Due to the scale of each HT shown in Fig. 4, I-sieve selects some hot HTs based on request count to store in RAM memory; the rest are stored on SSD. For the cached data blocks, there are three tiers of mapping in I-sieve, RAM, SSD, and Disk. All deduplicated data blocks are eventually stored on disks, however, newly added blocks are written to temporary SSD locations Fig. 4 Structure of the multi-cache in I-sieve. www.redpel.com +917620593389 www.redpel.com +917620593389
  • 7. Jibin Wang et al.: I-sieve: An Inline High Performance Deduplication System Used in Cloud Storage 23 and then imported into disks in a particular time zone. The system also caches hot spot blocks in RAM in order to improve read performance; the selection of blocks is based on the value of the reference count described in HT. 4 Evaluations For our evaluations of I-sieve, we primarily use the deduplication ratio and the user response time as the key performance indicators. The experimental settings for each test are the same, and each test is run three times to eliminate any discrepancies. 4.1 Experimental settings The testbed details of the server and client are listed in Table 1, and they are connected via a gigabit switch. In addition, a Microsoft iSCSI software initiator is installed on the client and the corresponding iSCSI target with the I-sieve module is deployed on the storage server, on which a commercial cloud storage system is deployed. 4.2 Implementation and workloads We implement I-sieve prototype based on the iSCSI code[37] with about 2000 lines of C code. The I-sieve prototype is released separately as a module; thus, it can be easily deployed in most application environments. The workloads that drive our experiments have two parts. One is a live Virtual Machine (VM) image, the other is the IOmeter tool. To simulate a real work environment, we setup two Windows systems in virtual machines together with some frequently applications (e.g., MS Office, application development environment, etc.). The details of the experimental workloads are listed in Table 2. Note that all VMs in our experiments are set to 10 GB in size. We also evaluate foreground performance using the Table 1 Hardware details of the storage server and the client. Note that all hardware on both the client and the server are the same except for the OS and memory. OS (server) Fedora 14 (kernel-2.6.35) OS (client) Windows 7 Ultimate with SP1(X86) Mainboard Supermicro X8DT6 Disks 2 Seagate ST31000524AS, 1 TB SSD server 1 Intel SSD 520 Series, 128 GB CPU Intel(R) Xeon(R) CPU 5650 NIC Intel 82574L Memory (server) 32 GB DDR3 Memory (client) 8 GB DDR3 Table 2 Operating systems and typical applications used in the evaluation. Note that the system setup images and software are obtained from official websites or mirror sites. All their install processes use the default settings in VMs. Operating system VM descriptions and abbreviations None Office 2003 Office 2007 VS 2005 VS 2008 Windows XP SP2 SP2 SP2+O3 SP2+O7 SP2+V5 SP2+V8 Windows XP SP3 SP3 SP3+O3 SP3+O7 SP3+V5 SP3+V8 IOmeter tool. The request size generated by IOmeter is 4 KB because the average file size in the file system is 4 KB[29] . Since some studies[29, 33] have confirmed that fixed-sized block chunking is suitable for VM applications, we adopt the fixed-sized chunking algorithm in our deduplication engine in I-sieve. Unless otherwise noted, the workloads used in evaluations run completely on the client, and we observe and analyze experimental results based on the original record from the storage server. 5 Results Analysis We first validate whether I-sieve can eliminate data redundancy significantly in virtual machines. Figure 5a shows the actual data capacity increasing based on the Fig. 5 Performance of the data deduplication in I-sieve. Note that all live virtual machine images are stored to the storage pool in order. www.redpel.com +917620593389 www.redpel.com +917620593389
  • 8. 24 Tsinghua Science and Technology, February 2015, 20(1): 17-27 deployment orders. I-sieve gets 82% space saving for the Windows SP2 virtual image. Excluding the zero- filled blocks of the virtual machine image[38] , I-sieve still achieves a 23% deduplication ratio for the Windows XP SP2 system. Similar results are also seen with other virtual machine images. In addition, we observe some interesting phenomena. First, different applications within the same operating system have limited data duplication, such as SP2+O3 and SP2+V5. Also, most of the data redundancy exists in different service packs of the same application. As an example, the deduplication result of SP3+O3 achieves at least 86.54% data reduction compared to SP2+O3. Third, we can clearly see the changes of Windows systems and their applications iteration versions from the variety of deduplication ratio as shown in Fig. 5b, which also points us a sensible deployments for the VM. The second part of our evaluation is to test I-sieve foreground performance to confirm the feasibility of inline deduplication. We simulate common I/O features using IOmeter, and compare with three different models, the traditional iSCSI (RAW for short) model, I-sieve deduplication with the cache module (DC), and I-sieve deduplication with the cache and bloom filter modules (DCB). As shown in Fig. 6, I-sieve outperforms RAW under all six cases; especially for the random write application, the response time of DC is only 0.7649 ms, with an approximately 91% improvement over RAW. This is because all I/O requests to I-sieve are cached by our multi-cache module both for sequential and random I/Os. RAW only gives the optimizations for sequential operations; small sequential I/Os are combined with a big sequential data block giving to underlying disks to improve performance. From Fig. 6, one can see that I-sieve shields I/O Fig. 6 Evaluations of the foreground performance in I-sieve. All results are obtained using the IOmeter tool installed on the client. The labels R, W, Seq, and Rnd in the legend represent read operation, write operation, sequence, and random I/O modes, respectively. features of six types of applications, and shows good approximation of foreground performance compared to RAW. The reason is that the structure of I-sieve is below the file system; consequently, it does not know the differences among the I/O requests, such as a read I/O request or a write I/O request, and all I/O requests go through the complete deduplication process. As also can be seen from the figure, I-sieve with bloom filter module (DCB) shows slightly better performance improvement compared with DC, and the results prove the impact of the bloom filter on the speed of block retrieval. 6 Related Works Deduplication technology has been widely used in most inline storage systems including primary[6] and secondary storage[25] . There are also some differences in deduplicaton granularity. In file level deduplication[12] , the deduplication ratio is not obvious[13, 29] when compared with block level deduplication[4–7] or sub-file level[8–11] . Therefore, block chunking algorithms are often used to realize block level deduplicaton, including fixed-sized and variable-sized algorithms. Some applications (e.g., office, virtual machine environments, etc.) are well suited for fixed-sized deduplication[29, 33] . To improve foreground performance for inline deduplication, iDedup[6] , a latency-aware inline data deduplication system for primary storage describes a novel mechanism to reduce latency. It takes advantage of the spatial locality and temporal nature of the primary workloads in order to avoid extra disk I/Os and seeks. iDedup is complementary to I-sieve, and together they can further shorten response time. Besides the optimization based on workloads, Chunk- Stash[7] uses a three-tier deduplication architecture by adding a fast device, SSD. By utilizing SSD, the penalty of index lookup misses in RAM is reduced by orders of magnitude because such lookups are served from a fast, flash-based index. In the same spirit, we also employ SSD as a strategy for improving performance in I-sieve. In contrast to current optimizations for inline deduplication systems, I-sieve improves performance because of a shorter deduplication operation path and multi-cache structure. I-sieve can serve as an I/O filter between file I/O and disk I/O. There is no processing logic for data, such as “container” or “bin” structures in other approaches. www.redpel.com +917620593389 www.redpel.com +917620593389
  • 9. Jibin Wang et al.: I-sieve: An Inline High Performance Deduplication System Used in Cloud Storage 25 7 Conclusions In this paper, we propose I-sieve, a high performance inline deduplication system for use in cloud storage. We design novel index tables to satisfy the I-sieve architecture, since it is a bridge between frontend and backend systems. We also implement a prototype of I-sieve based on iSCSI, and evaluate it with virtual machines and the IOmeter tool. We present our detailed test results, and demonstrate that I- sieve has excellent foreground performance compared with traditional iSCSI applications. In addition, I- sieve is also suitable for deduplication in small storage environments, especially with virtual machine applications. Finally, I-sieve can co-exist with existing deduplication systems as long as they support the iSCSI protocol. Acknowledgements Thanks to Wei Huang for the work of programming of I-sieve prototype. We also would like to thank the anonymous reviewers for their valuable insights that have improved the quality of the paper greatly. This work was supported by the Young Scholars of the Shandong Academy of Science (No. 2014QN013) and the National High-Tech Research and Development (863) Program of China (No. 2012AA011202). References [1] EMC Corporation, The emc digital universe study, Technical Report, 2014. [2] International Data Corporation, The 2011 digital universe study, Technical Report, 2011. [3] D. R. Merrill, Storage economics: Four principles for reducing total cost of ownership, http://www.hds.com/ assets/pdf/four-principles-for-reducing-total-cost-of- ownership.pdf, 2011. [4] A. T. Clements, I. Ahmad, M. Vilayannur, and J. Li, Decentralized deduplication in san cluster file systems, in Proceedings of the 2009 Conference on USENIX Annual Technical Conference, 2009. [5] E. Kruus, C. Ungureanu, and C. Dubnicki, Bimodal content defined chunking for backup streams, in Proceedings of the 8th USENIX Conference on File and Storage Technologies, 2010. [6] K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti, idedup: Latency-aware, inline data deduplication for primary storage, in Proceedings of the 10th USENIX Conference on File and Storage Technologies, 2012. [7] B. Debnath, S. Sengupta, and J. Li, Chunkstash: Speeding up inline storage deduplication using flash memory, in Proceedings of the Annual Conference on USENIX Annual Technical Conference, 2010. [8] C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J. Szczepkowski, C. Ungureanu, and M. Welnicki, Hydrastor: A scalable secondary storage, in Proccedings of the 7th Conference on File and Storage Technologies, 2009, pp. 197–210. [9] W. Dong, F. Douglis, K. Li, H. Patterson, S. Reddy, and P. Shilane, Tradeoffs in scalable data routing for deduplication clusters, in Proceedings of the 9th USENIX Conference on File and Storage Technologies, 2011. [10] D. Bhagwat, K. Eshghi, D. D. E. Long, and M. Lillibridge, Extreme binning: Scalable, parallel deduplication for chunk-based file backup, in Proceedings of the IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems, 2009, pp. 1–9. [11] C. Ungureanu, B. Atkin, A. Aranya, S. Gokhale, S. Rago, G. Calkowski, C. Dubnicki, and A. Bohra, Hydrafs: A high-throughput file system for the hydrastor content- addressable storage system, in Proceedings of the 8th USENIX Conference on File and Storage Technologies, 2010. [12] W. J. Bolosky, S. Corbin, D. Goebel, and J. R. Douceur, Single instance storage in windows 2000, in Proceedings of the 4th Conference on USENIX Windows Systems Symposium, 2000. [13] C. Policroniades and I. Pratt, Alternatives for detecting redundancy in storage systems data, in Proceedings of the Annual Conference on USENIX Annual Technical Conference, 2004. [14] C. Liu, Y. Lu, C. Shi, G. Lu, D.-C. Du, and D.-S. Wang, Admad: Application-driven metadata aware deduplication archival storage system, in Proceedings of the Fifth IEEE International Workshop on Storage Network Architecture and Parallel I/Os, 2008, pp. 29–35. [15] S. Quinlan and S. Dorward, Venti: A new approach to archival data storage, in Proceedings of the 1st USENIX Conference on File and Storage Technologies, 2002. [16] B. Zhu, K. Li, and H. Patterson, Avoiding the disk bottleneck in the data domain deduplication file system, in Proceedings of the 6th USENIX Conference on File and Storage Technologies, 2008, pp. 1–14. [17] J. A. Paulo and J. Pereira, A survey and classification of storage deduplication systems, ACM Computing Surveys, vol. 47, no. 1, pp. 11: 1–11: 30, 2014. [18] F. Liu, Y. Sun, B. Li, B. Li, and X. Zhang, Fs2you: Peer- assisted semipersistent online hosting at a large scale, IEEE Transactions on Parallel and Distributed Systems, vol. 21, no. 10, pp. 1442–1457, 2010. [19] F. Liu, B. Li, B. Li, and H. Jin, Peer-assisted on-demand streaming: Characterizing demands and optimizing supplies, IEEE Transactions on Computers, vol. 62, no. 2, pp. 351–361, 2013. [20] F. Xu, F. Liu, L. Liu, H. Jin, B. Li, and B. Li, iaware: Making live migration of virtual machines interference- aware in the cloud, IEEE Transactions on Computers, vol. 63, no. 12, pp. 3012–3025, 2014. [21] F. Xu, F. Liu, H. Jin, and A. Vasilakos, Managing performance overhead of virtual machines in cloud computing: A survey, state of the art, and future directions, Proceedings of the IEEE, vol. 102, no. 1, pp. 11–31, 2014. www.redpel.com +917620593389 www.redpel.com +917620593389
  • 10. 26 Tsinghua Science and Technology, February 2015, 20(1): 17-27 [22] M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble, Sparse indexing: Large scale, inline deduplication using sampling and locality, in Proccedings of the 7th Conference on File and Storage Technologies, 2009, pp. 111–123. [23] A. Muthitacharoen, B. Chen, and D. Mazi`eres, A low- bandwidth network file system, in Proceedings of the 18th ACM Symposium on Operating Systems Principles, 2001, pp. 174–187. [24] D. R. Bobbarjung, S. Jagannathan, and C. Dubnicki, Improving duplicate elimination in storage systems, Transactions on Storage, vol. 2, no. 4, pp. 424–448, 2006. [25] D. Domain, Data domain appliance series: High-speed inline deduplication storage, http://www.emc.com/backup- and-recovery/data-domain/data-domain-deduplication- storage-systems.htm, 2011. [26] W. Leesakul, P. Townend, and J. Xu, Dynamic data deduplication in cloud storage, in Proceedings of the 8th IEEE International Symposium on Service Oriented System Engineering, 2014, pp. 320–325. [27] Y. Fu, H. Jiang, N. Xiao, L. Tian, F. Liu, and L. Xu, Application-aware local-global source deduplication for cloud backup services of personal storage, IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 5, pp. 1155–1165, 2014. [28] P. Puzio, R. Molva, M. Onen, and S. Loureiro, Cloudedup: Secure deduplication with encrypted data for cloud storage, in Proceedings of the 5th IEEE International Conference on Cloud Computing Technology and Science, 2013. [29] D. T. Meyer and W. J. Bolosky, A study of practical deduplication, in Proceedings of the 9th USENIX Conference on File and Storage Technologies, 2011. [30] K. Eshghi and H. K. Tang, A framework for analysing and improving content-based chunking algorithms, Tech. Rep. HPL–2005–30(RI), 2005. [31] National Institute of Standards and Technology, FIPS PUB 180-1: Secure hash Standards, Technical Report, 1995. [32] R. Rivest, The md5 message-digest algorithm, http://www.ietf.org/rfc/rfc1321.txt, 1992. [33] C.-H. Ng, M. Ma, T.-Y. Wong, P. P. C. Lee, and J. C. S. Lui, Live deduplication storage of virtual machine images in an open-source cloud, in Proceedings of the 12th International Middleware Conference, 2011, pp. 80–99. [34] C. Ungureanu, B. Atkin, A. Aranya, S. Gokhale, S. Rago, G. Cakowski, C. Dubnicki, and A. Bohra, Hydrafs: A high-throughput file system for the hydrastor content- addressable storage system, in Proceedings of the 8th USENIX Conference on File and Storage Technologies, 2010. [35] B. H. Bloom, Space/time trade-offs in hash coding with allowable errors, Communications of the ACM, vol. 13, no. 7, pp. 422–426, 1970. [36] G. Lu, Y. J. Nam, and D.-C. Du, Bloomstore: Bloom- filter based memory-efficient key-value store for indexing of data deduplication on flash, in Proceedings of the 28th IEEE Symposium on Mass Storage Systems and Technologies, 2012. [37] iSCSI enterprise target, http://sourceforge.net/projects/ iscsitarget/files/, 2010. [38] K. Jin and E. L. Miller, The effectiveness of deduplication on virtual machine disk images, in Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, 2009. Jibin Wang received the bachelor degree in computer science from Chengdu University of Information Technology, China, in 2007, the MS and PhD degrees in computer science from Huazhong University of Science and Technology, China, in 2010 and 2013, respectively. He is currently working in Shandong Computer Science Center (National Supercomputer Center in Jinan). His research interests include computer architecture networked storage system, parallel and distributed system, and cloud computing. Zhigang Zhao received the BS and MS degrees in management science and engineering from Shandong Normal University, China, in 2002 and 2005, respectively. Presently, he is a research assistant in the Shandong Computer Science Center (National Supercomputer Center in Jinan). His research interests include cloud computing, service-oriented computing, and big data. Zhaogang Xu received the bachelor degree in network engineering from Shandong University of Science and Technology, China, in 2010, and the MS degree in computer architecture from Shandong University of Science and Technology, China, in 2013. Now, he works in the Shandong Computer Science Center (National Supercomputer Center in Jinan), Shandong Provincial Key Laboratory of Computer Networks. His research interests include cloud computing and big data storage. Hu Zhang received the bachelor degree in electronic information science and technology from Liaocheng University of Information Technology, China, in 2009, and the MS degree in signal and information processing from Southwest Jiaotong University, China, in 2013. Presently, he is an engineer in the Shandong Computer Science Center (National Supercomputer Center in Jinan), Shandong Provincial Key Laboratory of Computer Networks. His research interests include cloud computing and big data mining. www.redpel.com +917620593389 www.redpel.com +917620593389
  • 11. Jibin Wang et al.: I-sieve: An Inline High Performance Deduplication System Used in Cloud Storage 27 Liang Li received the bachelor degree in computer science from Shandong University of Science and Technology, China in 2003, MS degree in computer science from Shandong University in 2006. Presently, she is a research assistant in the Shandong Computer Science Center (National Supercomputer Center in Jinan), Shandong Provincial Key Laboratory of Computer Networks. Her research interests include service-oriented computing and big data mining. Ying Guo received the bachelor degree in information management from Shandong University of Finance and Economics, China, in 2000, the MS degree in enterprise management from Dongbei University of Finance and Economics, China, in 2004. She is an associate research professor and currently working for Shandong Computer Science Center. Her research interests include cloud computing and software engineering. www.redpel.com +917620593389 www.redpel.com +917620593389