2. Smarter Primary Storage Through Real-time Compression
Smarter Primary Storage Through Real-time Compression
It is no great revelation to learn that primary storage continues to grow at an alarming rate. What may not
be immediately obvious is the ripple effect of that growth and that the data supporting every process
throughout the storage lifecycle is also increasing in size. Where this is most typically seen is in increased
backup storage consumption. Whatʼs often missed is the impact of primary storage growth on data
management practices like data protection.
The traditional ʻsolutionʼ to resolving this capacity problem has been to buy and implement even more
capacity. Storage is relatively inexpensive and the cost to add more can seem like a quicker and easier
way out of a capacity problem than the alternatives. Also, storage systems have advanced to the point
that, physically connecting more storage is less disruptive than it once was. Even though storage capacity
continues to be inexpensive there is a ripple effect to its addition that impacts overall efficiency and this
ripple effect is now haunting data centers. No longer can IT organizations continue to keep addressing the
problem by adding more and more capacity.
Even if the additional storage can be cost-justified there is also the threat of running out of data center
floor space or cooling capability, and the very real cost of managing that additional capacity. Interestingly,
when only these ʻsupportʼ costs are considered, the rate at which storage is now growing is still outpacing
its reduction in cost per GB.
In short, IT has to do more with the capacity it already has and maybe even shrink that down to a more
management size. . The costs of ʻcare and feedingʼ each additional GB of storage are simply too great.
IBM Real-time Compression, which addresses storage expansion at its root cause, primary storage, may
be the ideal solution. It not only reduces immediate storage growth, but has the long term impact of
increasing the efficiency of storage administrators and reducing costs.
Primary Storage Growth's Ripple Effect
When primary storage grows itʼs not just the capacity thatʼs added to the storage system upfront, itʼs the
impact of those additions. Physically adding more shelves to a storage system increases its footprint.
Data center floor space is becoming one of the most expensive resources in IT, along with power and
cooling. Each additional shelf requires more power and cooling and reduces air flow, driving the need for
even more cooling which in turn drives up power usage further.
The next area of concern is the impact on time efficiency of the IT staff that manages storage and the
time lost by users while they wait for that storage to come online. Even if storage can be successfully
added to a live storage system, human decisions need to be made, and thatʼs where the dynamic nature
of storage expansion comes to a grinding halt. First, it must be decided how the new capacity will be
provisioned out. Then there is also the administrative overhead generated by either provisioning new
volumes or extending existing ones. If new volumes are created it needs to be determined how large
each volume will be, which server it will be assigned to and then modifications need to be made to those
attaching servers so that they can mount the new volumes. If volumes are to be extended there may be
down time associated with that process as well.
This provisioning process takes time if itʼs to be done accurately, and if accuracy is sacrificed for speed,
even more inefficiency creeps into the environment. Most data centers report that it takes typically a week
to a month to provision new storage once it has been received on the loading dock. This is time that the
users of these applications simply may not have and results in either the above stated inaccuracy or
many late nights for the IT staff, as well as dissatisfaction on the part of users.
7/19/2011 Page 2 of 7
3. Storage Switzerland, LLC
The Snapshot Ripple Effect
There is also the impact of clones which leverage snapshots. Formerly a tool exclusively just for data
protection, clones are now the production use of those snapshots which are used to reduce the capacity
deployment requirements of new volumes by making snapshot copies writable. While snapshots are
space efficient by design the more of them that are in place and the further away from the original they
are the more growth occurs.
In a VMware example, a snapshot may be used to move a virtual machine to a prior state where a clone
will be used to build additional VMs using the base image as a master, or ʻgoldenʼ image. Both are space
efficient but do incur growth as the snapshot ages or the clone is personalized and changed from the
master. Multiply these small growth additions across dozens or even hundreds of VMs and there is the
potential for loss in storage efficiency. Also, these are net new changes so other data efficiency
techniques, like deduplication, wonʼt be effective against this growth.
Increasing Primary Storage Efficiency with IBM Real-time Compression
As stated above the problem is not just that more primary storage capacity has to be bought and paid for,
itʼs also that adding this primary storage adds costs in time, space and other resources. The answer may
be to increase primary storage optimization through the use of storage efficiency technologies like IBM
Real-time Compression. Potentially the best place to improve this efficiency is upfront, at its source, as
data is being written to and from primary storage. This is what real-time compression does.
IBM Real-time Compression optimizes data before itʼs ever stored on the hard disk, providing up to a 5x
space reduction. To do this requires that the device performing the optimization be placed inline between
the servers and the storage. As data goes through the IBM Real-time Compression Appliance itʼs
compressed and then sent to network attached storage (NAS) device.
Compressing a data stream increases the ʻeffective throughputʼ, as the same amount of information is
contained in less data, and less physical space. This means that all the components of the storage
system have to handle less data and as a result, instantly perform better and more efficiently. The
bandwidth between the device and the storage and between the storage system and the shelves
increases. The effective capacity of the cache in the storage increases, and even the efficiency of the
drives improves, since more data can be collected on each rotation of the drive platters. The result is that
even though the optimization device is compressing all data inline, it does so without performance impact,
in most cases actually delivering a performance improvement.
Finally, there is also the obvious gain in storage capacity utilization. As stated above, the demand for
increased capacity, especially when you factor in the cost of power and cooling, may be out pacing the
expected cost reductions of that capacity. This means that, even from a hard cost basis, itʼs no longer
“cheaper” to buy more storage than to invest in efficiency.
There is also an efficiency effect in other primary storage functions like the overhead associated with
RAID parity and the extra space required by clones. With IBM Real-time Compression the capacity
required to store cloned volumes is reduced by up to 80%. This means that clones can typically be
maintained for a longer period of time since they will require less disk capacity. It also means that updates
to the clones, driven by changes, will occur faster since less actual data has to be modified.
Inline compression solutions, like IBMs Real-time Compression Appliances, provide what could be
potentially the fastest and simplest way to increase storage efficiencies. For example, using the 5x
compression ratio, a 100TB storage system containing 100TB of data could be reduced to 20TB.
7/19/2011 Page 3 of 7
4. Smarter Primary Storage Through Real-time Compression
Besides creating another 80% of ʻinstant capacityʼ, using real-time compression also reduces by 80% the
amount of data handled through every process in the data stream. This results in lower power, cooling
and floor space consumption, as well as less time and energy spent on detailed implementation and
provisioning plans by storage administrators. By storing more information in the same data space less
provisioning work has to occur, making IT staff more efficient.
Imagine a 100TB system that was 100% full now only being 20% full. That means 80TB of additional
growth before a new storage system needs to be implemented. That leads then to an efficiency gain in all
the other areas mentioned. No more additional floor space, power and cooling needs to be consumed nor
do storage administrators need to spend time working up detailed implementation and provisioning plans.
Smarter Data Protection
Data growth is also occurring in the data protection process as well as with primary storage. While
applying IBM Real-time Compression at the primary storage level helps with capacity growth, real-time
compression brings its own unique value in storing protected data, and similar to primary storage, has its
own ʻripple effectʼ in other areas of the infrastructure. The growth in data protection storage is not only
caused by the growth of primary storage, but also by growth in the number of redundant instances of data
now found in the data protection process.
For example, a snapshot of data is often taken on primary storage and then replicated to a secondary
location for disaster preparedness. The primary data is then backed up by traditional backup applications
daily and weekly to a disk storage area. Then, the backup jobs themselves are often replicated by a disk
backup appliance to a DR location which is finally copied to tape drives. While each of these processes
may have its own optimization capabilities, data has to be ʻre-inflatedʼ or ʻde-optimizedʼ before it can be
moved between process and storage types. IBM Real-time Compression can improve the efficiencies of
the individual optimization steps that may exist for each of the processes and make the transport between
them more effective.
The Data Protection Ripple Effect
Backup Software Ripple Effect
Today in the enterprise most backups are network-based, meaning that all the data has to be moved
across the network to the backup server. While slower, it is significantly more cost effective than direct-
attached or fibre-attached storage. When compared to WAN replication, local area network bandwidth
available may seem huge, but itʼs not so when considering that most applications donʼt have the ability to
backup only changed blocks. While some have the intelligence to perform incremental backup (changed
files only) and then merge them, most have to back up the entire data set
Snapshot Growth
Depending on the storage system or operating system, when theyʼre taken, snapshots typically set
current blocks of storage to read-only. Then, as users make changes to data that would affect these
blocks, a new block is stored to represent the active, up-to-date data. Snapshots then leverage this read-
only collection of blocks to represent how the data looked at that point in time. When the snapshot has
expired the read-only blocks that changed are released and returned to the storage pool, available to be
written over. As a result snapshots are space-efficient, the only growth occurring when a block is added or
modified. But the longer snapshots are held and the more of them that are taken, the more space that has
7/19/2011 Page 4 of 7
5. Storage Switzerland, LLC
to be reserved for their use. With most systems as the reserve areas begin to run out of space, snapshot
completion times grow longer and will cease altogether if there is no available reserve space.
Replication Growth
For most systems snapshots are also the core technology for an off-site replication feature. They leverage
the same changed block tracking technique to know which blocks should be sent across the wide area
network (WAN) to update the storage system at the DR site. Again, while snapshots may be space-
efficient in an active system, when consuming bandwidth over slow WAN lines, multiple snapshots
combined with a low WAN transfer speed can become a significant bottleneck to the process because
even the small growth of snapshots is more than many WAN segments can maintain. This causes the two
systems to be out of sync for an increasing amount of time; and in some cases, they may never catch up.
There is also the issue with the DR site, where storage often must be a mirror image, in size and capacity,
of the local storage system. This means that when capacity is added to the primary system it must also be
added to the DR system.
The Weaknesses of Deduplication Only
In many environments local backups now use a combination of disk and tape to store data. While disk
does improve performance the most popular disk based solutions are attached via the IP network, as are
most of the servers being backed up. These disk- based backup solutions have almost all added
deduplication to improve backup efficiency, but most do so after the data has been sent across the LAN
and received by the backup device.
A second weak point in deduplication is that it only works if there is redundant data available; net new
data typically wonʼt deduplicate well. As a result disk backup deduplication systems are inefficient when
transferring data between their own devices and could stand to be more efficient when storing data to
their devices as well.
The Remote Vaulting Ripple Effect
Remote backups are becoming an increasingly popular method to electronically move data off-site,
instead of copying data to tape and shipping the cartridges to a vault. The requirements of remote backup
are similar when compared to replication mentioned above. The ability to optimize the WAN connection
between the two locations is important as is the ability to reduce the storage footprint of data at the
remote location. Again, deduplication only works on redundant data and in most cases requires identical
devices.
The Restoration Ripple Effect
The final area to consider is the impact on restoration from all these different devices. Snapshots have
probably the least restoration impact since they can be directly mounted and in some cases directly used
by the application. Their challenge is the length of time theyʼre retained and the fact theyʼre typically only
suitable for recovering the most recent copy of data. As discussed above if a further point in time is
required another backup storage source must be used. The problem with these remaining storage
sources is the speed at which they can deliver data. Theyʼre not only constrained by the network they
happen to be on but are also inhibited by the storage optimization schemes that they use. Deduplicated
data must be reassembled on the fly as itʼs restored to the recovery location, something which adversely
impacts performance in almost every case.
7/19/2011 Page 5 of 7
6. Smarter Primary Storage Through Real-time Compression
Increasing Data Protection Efficiency with IBM Real-time Compression
IBM Real-time Compression is an ideal solution for most of these data protection challenges and by
implementing it along side of the existing solutions it can increase efficiencies across every aspect of the
data protection process, not just in storage capacity. The key uniqueness of IBM Real-Time Compression
is that, unlike other capacity optimization solutions, it has the ability to keep data in an optimized state
throughout the data protection workflow. In a data protection deployment, the IBM Real-time Compression
Appliance sits in front of the primary storage. Any data that is "behind" the appliance will stay in a
compressed state and gain the efficiencies of IBM Real-time Compression, typical up to 5x compression
rates.
Impact of more efficient local backups
IBM Real-time Compression adds the ability to improve backup software network performance by keeping
the data in a compressed state as it moves from primary storage to backup storage. In addition to the
performance gains on the network and the capacity gains in backup storage, the backup server itself also
sees a performance improvement. It has less data to handle, that data is already in a compressed format
and it doesnʼt have to wait as long to confirm that the data has been written to backup storage. The result
should be that real-time data compression not only makes the backup process faster but is able to extend
the useful life of the backup server itself.
Impact of more efficient snapshots
Another area of positive impact is with snapshots and clones. Snapshots are references to previous views
of data at certain points in time and are typically used for data protection and recovery. A snapshot may
be used to move a volume or virtual machine to a prior state. Like clones they are space-efficient, but do
incur growth as the snapshot ages. With IBM Real-time Compression the capacity required to store
snapshotted or cloned volumes is reduced by up to 80%. This means that snapshots can typically be
maintained for a longer period of time since they will require less disk capacity.
Impact of more efficient WAN replication
Disaster Recovery (DR) capability is an important item on every IT agenda, with the foundation of a DR
plan to make sure that data is available off-site. This is typically done by a WAN replication process
available within the storage system. Despite the intelligence of block replication, the speed limitations of
the typical WAN can impact how "in-sync" the remote location is. In busy environments, with lots of
changed data and a slow WAN connection, the DR site could be dozens of minutes, or even hours, out of
sync with the primary location. Leveraging IBM Real-time Compression once again reduces the typical
data size by up to 5x, effectively providing 5 times more bandwidth, and helping to make a DR site thatʼs
30 minutes out of sync become 100% in-sync.
Impact of more efficient remote backups
Remote vaulting of backup data is becoming increasingly popular compared with the older ʻtape and
truckʼ method. It has been enabled in a large part within data deduplication devices, which only replicate
net new blocks of data that they receive on the device, similar to snapshots. They create an important
second tier in DR strategies, after the replication of primary volumes described above. The replication
process that these deduplication systems use is limited to the available bandwidth on the WAN and once
7/19/2011 Page 6 of 7
7. Storage Switzerland, LLC
again, is helped by real-time compression. In fact, even in cases where the deduplication appliance can
perform compression, products like IBM Real-time Compression make that process more efficient.
Real-time compression should not be viewed as a competitor to traditional deduplication, but as a
complement. It improves transfer performance across the network to the deduplication device, improves
the storage efficiency of the device, improves the deduplication analysis performance of the device and
improves the WAN replication capabilities.
Impact of more efficient restores
The real benefit of all of the efficiencies that IBM Real-time Compression brings to the backup process is
in the restore process. Compared with traditional backup applications, the speed increase with which data
can be located and pulled from the backup device and sent through the backup server to its final
destination is significant. Itʼs not atypical for restore jobs to take hours and dividing that restore time by 5
can make the difference between an application being back online in one hour instead of five.
Product Analysis - What Is The IBM Real-time Compression Appliance?
IBM Real-time compression is a technology that is hosted on an appliance, available in two models that
are sized for the appropriate workload that a data center needs to be optimized. The STN6500 is ready to
deploy into 1 Gb Ethernet NAS environments and has 16 ports, supporting 8 connections between NAS
systems and network switches. The STN6800 can be customized with multiple 10 Gb and 1 Gb Ethernet
ports to support high throughput requirements. It can have up to four 10GbE connections for high
throughput environments or can mix two 10GbE and four 1GbE connections for greater flexibility.
Implementation of the IBM Real-time Compression is straight forward. Effectively the appliances sit
between the network switches and the storage devices or NAS heads. As the name implies compression
happens inline prior to the data being stored on the NAS disk. There are no changes required to the
storage systems or shared volumes that the NAS is hosting. Once activated, as data is written to the NAS
systems, the path is now through the IBM Real-time Compression Appliances.
With inline compression in place the data center will come to count on its space savings and efficiency
gains. It will become important for some environments that the IBM Real-time Compression Appliances
support high availability environments. For those situations there is the ability to add a second unit and
leverage automated failover between those units.
While it seems that adding the extra step of compression should degrade performance the opposite is
actually the case. The expense required to compress the data via these dedicated appliances is less than
the gain seen by the storage infrastructure because it now has to deal with less data.
Summary
The effect of the unprecedented data growth seen by data centers is causing more than just storage
budgeting problems for IT Managers. It is costing efficiency of employees and processes. With IBMʼs
Real-time Compression Appliances data centers can re-gain control over their data problems. With its
implementation they will see a ripple effect as the benefits of increased efficiencies within the production
and data protection processes spread from primary storage throughout the infrastructure. The net result
will be an improvement in the productivity of personnel and the ability of those processes to keep up with
user demands.
!
!
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"#$%&%'()*#+,)%%!!!
7/19/2011 Page 7 of 7