SlideShare ist ein Scribd-Unternehmen logo
1 von 4
Downloaden Sie, um offline zu lesen
A White Paper




Demystifying Data De-Duplication:
Choosing the Best Solution
Selecting a data de-duplication solution that enables cost effective,
high performance, and scalable long-term data storage




                                               Abstract: Data de-duplication has been described as
                                               the next evolutionary step in backup technology. Many
                                               vendors lay claim to the best data de-duplication approach,
                                               leaving customers to face the difficulty of separating hype
                                               from reality. A number of key factors must be considered
                                               in order to select a data de-duplication solution that
                                               actually delivers cost-effective, high-performance, and
                                               scalable long-term data storage. This document provides
                                               the background information required to make an
                                               informed data de-duplication purchasing decision.
Demystifying Data De-Duplication: Choosing the Best Solution



Introduction                                                               Key Criteria for a Robust
                                                                           Data De-Duplication Solution
In an August 2006 report from the Clipper Group entitled ‘The
Evolution of Backups – Part Two – Improving Capacity’, author              There are eight key criteria to consider when evaluating data
Dianne McAdam writes: “De-duplication is the next evolutionary             de-duplication solutions:
step in backup technology.” Eliminating duplicate data in secondary
storage archives can slash media costs, streamline management              1. Focus on the largest problem
tasks, and minimize the bandwidth required to replicate data.              2. Integration with current environment
                                                                           3. Virtual tape library (VTL) capability
Despite the appeal of this data de-duplication concept, widespread         4. Impact of de-duplication on backup performance
adoption has been slowed by the high costs of processing power             5. Scalability
required to identify duplicate data, index unique data, and restore        6. Distributed topology support
compacted data to its original state. The path has only recently           7. Real-time repository protection
been cleared for a proliferation of data de-duplication solutions          8. Efficiency and effectiveness
as processing power has become more cost-effective.
                                                                           1. Focus on the largest problem
Many vendors lay claim to the best data de-duplication approach,
                                                                           The first consideration is whether the solution attacks the area
leaving customers to face the difficulty of separating hype from
                                                                           where the largest problem exists: backup data in secondary storage.
reality and determining which factors are really important to their
                                                                           Duplication in backup data can cause its storage requirement to be
business. With some vendors setting unrealistic expectations by
                                                                           many times that which would be required if the duplicate data
predicting huge reductions in data volume, early adopters may
                                                                           could be eliminated.
find themselves ultimately disappointed with their solution.
                                                                           The following graphic, courtesy of the Enterprise Strategy Group
Companies must consider a number of key factors in order to
                                                                           (ESG), illustrates why a new technology evolution in backup is
select a data de-duplication solution that actually delivers cost-
                                                                           necessary. Incremental and differential backups were introduced to
effective, high-performance, and scalable long-term data storage.
                                                                           decrease the amount of data required compared to a full backup, as
This document provides the background information required to
                                                                           depicted in Figure 1.
make an informed data de-duplication purchasing decision.
                                                                           However, even within incremental backups, there is significant
Data De-Duplication is Becoming                                            duplication of data when protection is based on file-level changes.
                                                                           When considered across multiple servers at multiple sites, the
an Operational Requirement                                                 opportunity for storage reduction by implementing a data
Because secondary storage volumes are growing exponentially,               de-duplication solution becomes huge.
companies need a way to dramatically reduce data volumes.
                                                                           2. Integration with current environment
Regulatory changes such as the Federal Rules of Civil Procedure
                                                                           An effective data de-duplication solution should be as non-disruptive
magnify the challenge, forcing businesses to change the way they
                                                                           as possible. Many companies are turning to virtual tape libraries
look at data protection. By eliminating duplicate data and ensuring
                                                                           to improve the quality of their backup without disruptive changes
that data archives are as compact as possible, companies can keep
                                                                           to policies, procedures, or software. This makes VTL-based data
more data online longer, at significantly lower costs. As a result, data
                                                                           de-duplication the least disruptive way to implement this technology.
de-duplication is fast becoming a required technology for any
                                                                           It also focuses on the largest pool of duplicated data: backups.
company wanting to optimize the cost-effectiveness and
performance of its data storage environment.
                                                                           Solutions requiring proprietary appliances tend to be less cost-
                                                                           effective than those providing more openness and deployment
Although compression technology can deliver an average 2:1
                                                                           flexibility. An ideal solution is one that is available as both software
data volume reduction, this is only a fraction of what is required
                                                                           and turnkey appliances in order to provide the maximum
to deal with the data deluge most companies now face. Only data
                                                                           opportunity to utilize existing resources.
de-duplication technology can meet the requirements companies
have for far greater reductions in data volumes.
                                                                           3. VTL capability
                                                                           If data de-duplication technology is implemented around a VTL,
Data de-duplication can also minimize the bandwidth needed to
                                                                           the capabilities of the VTL itself must be considered as part of
transfer backup data to offsite archives. With the hazards of physically
                                                                           the evaluation process. It is unlikely that the savings from data
transporting tapes well-established (damage, theft, loss, etc.),
                                                                           de-duplication will override the difficulties caused by using a
electronic transfer is fast becoming the offsite storage modality
                                                                           sub-standard VTL. Consider the functionality, performance, stability,
of choice for companies concerned about minimizing risks and
                                                                           and support of the VTL as well as its de-duplication extension.
protecting essential resources.
Demystifying Data De-Duplication: Choosing the Best Solution




Figure 1: How various backup methods measure up




            Full backups                                                                                                           Differential backups
     (include all data with                                                                                                        (include all data modified
            every backup)                                                                                                          since the last full backup)




                                                                                                                                   De-duplicated backups
   Incremental backups
                                                                                                                                   (includes files modified
   (include only data since
                                                                                                                                   and unique data created
      the previous backup)
                                                                                                                                   since the previous backup)




4. Impact of de-duplication on backup performance                              long-term growth of the system. For example, a clustering approach
It is important to consider where and when data de-duplication                 allows organizations to scale to meet growing capacity requirements –
takes place in relation to the backup process. Although some                   even for environments with many petabytes of data – without
solutions attempt de-duplication while data is being backed up,                compromising de-duplication efficiency or system performance.
this inline method processes the backup stream as it comes into the            Clustering enables VTL to be managed and used logically as a single
de-duplication appliance, making performance dependant on the                  data repository, supporting even the largest of tape libraries. Clustering
single node’s strength. Such an approach can slow down backups,                also inherently provides a high-availability environment, protecting
jeopardize backup windows, and degrade VTL performance by as                   the VTL and de-duplication nodes by offering failover support
much as 60% over time.                                                         (Figure 2).

                                                                               6. Distributed topology support
By comparison, data de-duplication solutions that run after backup
                                                                               Data de-duplication is a technology that can deliver benefits
jobs complete or concurrently with backup avoid this problem and
                                                                               throughout a distributed enterprise, not just in a single data center.
have no adverse impact on backup performance. This post-process
                                                                               A solution that includes replication and multiple levels of
method processes the backup data by reading it from the VTL
                                                                               de-duplication can achieve maximum benefits from the technology.
storage after backups have been cached to disk, which ensures that
backups are not throttled by VTL or storage limitations. An
                                                                               For example, a company with a corporate headquarters, three
enterprise-class solution that offers this level of flexibility is ideal for
                                                                               regional offices, and a secure disaster recovery (DR) facility should
organizations looking for a choice of de-duplication methods.
                                                                               be able to implement de-duplication in the regional offices to
                                                                               facilitate efficient local storage and replication to the central site.
For maximum manageability, the solution should allow for granular
                                                                               The solution should only require minimal bandwidth for the central
(tape- or group-level) policy-based de-duplication based on a variety
                                                                               site to determine whether the remote data is contained in the central
of factors: Resource utilization, production schedules, time since
                                                                               repository. Only unique data across all sites should be replicated to
creation, and so on. In this way, storage economies can be achieved
                                                                               the central site and subsequently to the DR site, to avoid excessive
while optimizing the use of system resources.
                                                                               bandwidth requirements.
5. Scalability
                                                                               7. Real-time repository protection
Because the solution is being chosen for longer-term data storage,
                                                                               Access to the de-duplicated data repository is critical and
scalability, in terms of both capacity and performance, is an important
                                                                               should not be vulnerable to a single point of failure. A robust data
consideration. Consider growth expectations over five years or
                                                                               de-duplication solution will include mirroring to protect against
more. How much data will you want to keep on disk for fast access?
                                                                               local storage failure as well as replication to protect against disaster.
How will the data index system scale to your requirements?
                                                                               The solution should have failover capability in the event of a node
A de-duplication solution should provide an architecture that allows           failure. Even if multiple nodes in a cluster fail, the company must be
economic “right-sizing” for both the initial implementation and the            able to continue to recover its data and respond to the business.
Demystifying Data De-Duplication: Choosing the Best Solution




Figure 2: An example of a clustered de-duplication architecture




                                                                                                                                 Summary: Focus on the Total Solution
8. Efficiency and effectiveness
File-based de-duplication approaches do not reduce storage capacity
                                                                                                                                 As stored data volumes continually increase due to the demands of
requirements as much as those that analyze data at a sub-file or block
                                                                                                                                 business applications and regulatory requirements, data de-duplica-
level. Consider, for example, changing a single line in a 4MB presen-
                                                                                                                                 tion is fast becoming a vital technology. Data de-duplication is the
tation. In a file-based solution, the entire 4MB file must be stored,
                                                                                                                                 only way to dramatically reduce data volumes, slash storage
doubling the storage required. If the presentation is sent to multiple
                                                                                                                                 requirements, and minimize data protection costs and risks.
people, as presentations often are, the negative effects multiply.
Most sub-file de-duplication processes use some sort of “chunking”                                                               Although the benefits of data de-duplication are dramatic,
method to break up a large amount of data, such as a virtual tape                                                                organizations should not be seduced by the hype sometimes
cartridge, into smaller-sized pieces to search for duplicate data.                                                               attributed to the technology. No matter the approach, the amount
Larger chunks of data can be processed at a faster rate, but less                                                                of data de-duplication that can occur is driven by the nature of the
duplication is detected. It is easier to detect more duplication in                                                              data and the policies used to protect it.
smaller chunks, but the overhead to scan the data is much higher.
                                                                                                                                 In order to achieve the maximum benefit of de-duplication,
If the “chunking” begins at the beginning of a tape (or data stream                                                              organizations should choose data de-duplication solutions based
in other implementations), the de-duplication process can be fooled                                                              on a comprehensive set of quantitative and qualitative factors
by the metadata created by the backup software, even if the file is                                                              rather than relying solely on statistics such as theoretical data
unchanged. However, if the solution can segregate the metadata                                                                   reduction ratios.
and look for duplication in chunks within actual data files, the
duplication detection will be much higher. Some solutions even
adjust chunk size based on information gleaned from the data
formats. The combination of these techniques can lead to a 30-40%
increase in the amount of duplicate data detected. This can have a
major impact on the cost-effectiveness of the solution.




For more information, visit www.falconstor.com or contact your local FalconStor representative.

Corporate Headquarters                                   European Headquarters                                    Asia-Pacific Headquarters
USA                                                      France                                                   Taiwan
+1 631 777 5188                                          +33 1 39 23 95 50                                        +866 4 2259 1868
sales@falconstor.com                                     infoeurope@falconstor.com                                infoasia@falconstor.com
Information in this document is provided “AS IS” without warranty of any kind, and is subject to change without notice by FalconStor, which assumes no responsibility for any
errors or claims herein. Copyright © 2008 FalconStor Software. All Rights Reserved. FalconStor Software and FalconStor are trademarks or registered trademarks of FalconStor
Software, Inc. in the United States and other countries. All other company and product names contained herein are or may be trademarks of the respective holder.
                                                                                                                                                                                                        
DDDWP080211.

Weitere ähnliche Inhalte

Andere mochten auch

Keep Pharmaceuticals out of our water
Keep Pharmaceuticals out of our waterKeep Pharmaceuticals out of our water
Keep Pharmaceuticals out of our waterJim Mullowney
 
Smarter Customers Smarter Pricing Vdef
Smarter Customers Smarter Pricing VdefSmarter Customers Smarter Pricing Vdef
Smarter Customers Smarter Pricing VdefTheo Slaats
 
Analiza rozwoju portali społecznościowych w Internecie
Analiza rozwoju portali społecznościowych w InternecieAnaliza rozwoju portali społecznościowych w Internecie
Analiza rozwoju portali społecznościowych w InternecieSławomir Stańczuk
 
Polskie Sklepy Internetowe Raport Okazje.Info I Opineo
Polskie Sklepy Internetowe Raport Okazje.Info I OpineoPolskie Sklepy Internetowe Raport Okazje.Info I Opineo
Polskie Sklepy Internetowe Raport Okazje.Info I OpineoSławomir Stańczuk
 
Axfood Annual report 2010
Axfood Annual report 2010Axfood Annual report 2010
Axfood Annual report 2010Axfood
 
My Presentation Park Lay
My Presentation Park LayMy Presentation Park Lay
My Presentation Park Layjunowedd
 
公司简介 Ppt中文长版
公司简介 Ppt中文长版公司简介 Ppt中文长版
公司简介 Ppt中文长版wolves hu
 
Ley organica del_trabajo_los_trabajadores_y_las_trabajadoras
Ley organica del_trabajo_los_trabajadores_y_las_trabajadorasLey organica del_trabajo_los_trabajadores_y_las_trabajadoras
Ley organica del_trabajo_los_trabajadores_y_las_trabajadorasJosé Morales
 
Digital publishing and the user experience (no chart)
Digital publishing and the user experience (no chart)Digital publishing and the user experience (no chart)
Digital publishing and the user experience (no chart)Louise Bloom
 
Life In The Tundralanzillotta
Life In The TundralanzillottaLife In The Tundralanzillotta
Life In The TundralanzillottaMichelle McGinnis
 
Badanie procesu zakupowego w sklepach internetowych
Badanie procesu zakupowego w sklepach internetowychBadanie procesu zakupowego w sklepach internetowych
Badanie procesu zakupowego w sklepach internetowychSławomir Stańczuk
 
India Presentation
India PresentationIndia Presentation
India Presentationgmills01
 
Towards effective institutional policies to promote open access in educationa...
Towards effective institutional policies to promote open access in educationa...Towards effective institutional policies to promote open access in educationa...
Towards effective institutional policies to promote open access in educationa...Carina Soledad Gonzalez
 
Project Management Presentation
Project Management PresentationProject Management Presentation
Project Management Presentationamorton
 
Web 2.0 = Accessibility 2.0?
Web 2.0 = Accessibility 2.0?Web 2.0 = Accessibility 2.0?
Web 2.0 = Accessibility 2.0?Jared Smith
 

Andere mochten auch (18)

The Tundra Noeth
The Tundra NoethThe Tundra Noeth
The Tundra Noeth
 
Keep Pharmaceuticals out of our water
Keep Pharmaceuticals out of our waterKeep Pharmaceuticals out of our water
Keep Pharmaceuticals out of our water
 
Smarter Customers Smarter Pricing Vdef
Smarter Customers Smarter Pricing VdefSmarter Customers Smarter Pricing Vdef
Smarter Customers Smarter Pricing Vdef
 
Analiza rozwoju portali społecznościowych w Internecie
Analiza rozwoju portali społecznościowych w InternecieAnaliza rozwoju portali społecznościowych w Internecie
Analiza rozwoju portali społecznościowych w Internecie
 
Polskie Sklepy Internetowe Raport Okazje.Info I Opineo
Polskie Sklepy Internetowe Raport Okazje.Info I OpineoPolskie Sklepy Internetowe Raport Okazje.Info I Opineo
Polskie Sklepy Internetowe Raport Okazje.Info I Opineo
 
Axfood Annual report 2010
Axfood Annual report 2010Axfood Annual report 2010
Axfood Annual report 2010
 
My Presentation Park Lay
My Presentation Park LayMy Presentation Park Lay
My Presentation Park Lay
 
Ckv[1]
Ckv[1]Ckv[1]
Ckv[1]
 
公司简介 Ppt中文长版
公司简介 Ppt中文长版公司简介 Ppt中文长版
公司简介 Ppt中文长版
 
Ley organica del_trabajo_los_trabajadores_y_las_trabajadoras
Ley organica del_trabajo_los_trabajadores_y_las_trabajadorasLey organica del_trabajo_los_trabajadores_y_las_trabajadoras
Ley organica del_trabajo_los_trabajadores_y_las_trabajadoras
 
Digital publishing and the user experience (no chart)
Digital publishing and the user experience (no chart)Digital publishing and the user experience (no chart)
Digital publishing and the user experience (no chart)
 
Life In The Tundralanzillotta
Life In The TundralanzillottaLife In The Tundralanzillotta
Life In The Tundralanzillotta
 
Badanie procesu zakupowego w sklepach internetowych
Badanie procesu zakupowego w sklepach internetowychBadanie procesu zakupowego w sklepach internetowych
Badanie procesu zakupowego w sklepach internetowych
 
Arctic And Alpine Stinson
Arctic And Alpine StinsonArctic And Alpine Stinson
Arctic And Alpine Stinson
 
India Presentation
India PresentationIndia Presentation
India Presentation
 
Towards effective institutional policies to promote open access in educationa...
Towards effective institutional policies to promote open access in educationa...Towards effective institutional policies to promote open access in educationa...
Towards effective institutional policies to promote open access in educationa...
 
Project Management Presentation
Project Management PresentationProject Management Presentation
Project Management Presentation
 
Web 2.0 = Accessibility 2.0?
Web 2.0 = Accessibility 2.0?Web 2.0 = Accessibility 2.0?
Web 2.0 = Accessibility 2.0?
 

Kürzlich hochgeladen

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Kürzlich hochgeladen (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

White Paper Demystifying De Dup

  • 1. A White Paper Demystifying Data De-Duplication: Choosing the Best Solution Selecting a data de-duplication solution that enables cost effective, high performance, and scalable long-term data storage Abstract: Data de-duplication has been described as the next evolutionary step in backup technology. Many vendors lay claim to the best data de-duplication approach, leaving customers to face the difficulty of separating hype from reality. A number of key factors must be considered in order to select a data de-duplication solution that actually delivers cost-effective, high-performance, and scalable long-term data storage. This document provides the background information required to make an informed data de-duplication purchasing decision.
  • 2. Demystifying Data De-Duplication: Choosing the Best Solution Introduction Key Criteria for a Robust Data De-Duplication Solution In an August 2006 report from the Clipper Group entitled ‘The Evolution of Backups – Part Two – Improving Capacity’, author There are eight key criteria to consider when evaluating data Dianne McAdam writes: “De-duplication is the next evolutionary de-duplication solutions: step in backup technology.” Eliminating duplicate data in secondary storage archives can slash media costs, streamline management 1. Focus on the largest problem tasks, and minimize the bandwidth required to replicate data. 2. Integration with current environment 3. Virtual tape library (VTL) capability Despite the appeal of this data de-duplication concept, widespread 4. Impact of de-duplication on backup performance adoption has been slowed by the high costs of processing power 5. Scalability required to identify duplicate data, index unique data, and restore 6. Distributed topology support compacted data to its original state. The path has only recently 7. Real-time repository protection been cleared for a proliferation of data de-duplication solutions 8. Efficiency and effectiveness as processing power has become more cost-effective. 1. Focus on the largest problem Many vendors lay claim to the best data de-duplication approach, The first consideration is whether the solution attacks the area leaving customers to face the difficulty of separating hype from where the largest problem exists: backup data in secondary storage. reality and determining which factors are really important to their Duplication in backup data can cause its storage requirement to be business. With some vendors setting unrealistic expectations by many times that which would be required if the duplicate data predicting huge reductions in data volume, early adopters may could be eliminated. find themselves ultimately disappointed with their solution. The following graphic, courtesy of the Enterprise Strategy Group Companies must consider a number of key factors in order to (ESG), illustrates why a new technology evolution in backup is select a data de-duplication solution that actually delivers cost- necessary. Incremental and differential backups were introduced to effective, high-performance, and scalable long-term data storage. decrease the amount of data required compared to a full backup, as This document provides the background information required to depicted in Figure 1. make an informed data de-duplication purchasing decision. However, even within incremental backups, there is significant Data De-Duplication is Becoming duplication of data when protection is based on file-level changes. When considered across multiple servers at multiple sites, the an Operational Requirement opportunity for storage reduction by implementing a data Because secondary storage volumes are growing exponentially, de-duplication solution becomes huge. companies need a way to dramatically reduce data volumes. 2. Integration with current environment Regulatory changes such as the Federal Rules of Civil Procedure An effective data de-duplication solution should be as non-disruptive magnify the challenge, forcing businesses to change the way they as possible. Many companies are turning to virtual tape libraries look at data protection. By eliminating duplicate data and ensuring to improve the quality of their backup without disruptive changes that data archives are as compact as possible, companies can keep to policies, procedures, or software. This makes VTL-based data more data online longer, at significantly lower costs. As a result, data de-duplication the least disruptive way to implement this technology. de-duplication is fast becoming a required technology for any It also focuses on the largest pool of duplicated data: backups. company wanting to optimize the cost-effectiveness and performance of its data storage environment. Solutions requiring proprietary appliances tend to be less cost- effective than those providing more openness and deployment Although compression technology can deliver an average 2:1 flexibility. An ideal solution is one that is available as both software data volume reduction, this is only a fraction of what is required and turnkey appliances in order to provide the maximum to deal with the data deluge most companies now face. Only data opportunity to utilize existing resources. de-duplication technology can meet the requirements companies have for far greater reductions in data volumes. 3. VTL capability If data de-duplication technology is implemented around a VTL, Data de-duplication can also minimize the bandwidth needed to the capabilities of the VTL itself must be considered as part of transfer backup data to offsite archives. With the hazards of physically the evaluation process. It is unlikely that the savings from data transporting tapes well-established (damage, theft, loss, etc.), de-duplication will override the difficulties caused by using a electronic transfer is fast becoming the offsite storage modality sub-standard VTL. Consider the functionality, performance, stability, of choice for companies concerned about minimizing risks and and support of the VTL as well as its de-duplication extension. protecting essential resources.
  • 3. Demystifying Data De-Duplication: Choosing the Best Solution Figure 1: How various backup methods measure up Full backups Differential backups (include all data with (include all data modified every backup) since the last full backup) De-duplicated backups Incremental backups (includes files modified (include only data since and unique data created the previous backup) since the previous backup) 4. Impact of de-duplication on backup performance long-term growth of the system. For example, a clustering approach It is important to consider where and when data de-duplication allows organizations to scale to meet growing capacity requirements – takes place in relation to the backup process. Although some even for environments with many petabytes of data – without solutions attempt de-duplication while data is being backed up, compromising de-duplication efficiency or system performance. this inline method processes the backup stream as it comes into the Clustering enables VTL to be managed and used logically as a single de-duplication appliance, making performance dependant on the data repository, supporting even the largest of tape libraries. Clustering single node’s strength. Such an approach can slow down backups, also inherently provides a high-availability environment, protecting jeopardize backup windows, and degrade VTL performance by as the VTL and de-duplication nodes by offering failover support much as 60% over time. (Figure 2). 6. Distributed topology support By comparison, data de-duplication solutions that run after backup Data de-duplication is a technology that can deliver benefits jobs complete or concurrently with backup avoid this problem and throughout a distributed enterprise, not just in a single data center. have no adverse impact on backup performance. This post-process A solution that includes replication and multiple levels of method processes the backup data by reading it from the VTL de-duplication can achieve maximum benefits from the technology. storage after backups have been cached to disk, which ensures that backups are not throttled by VTL or storage limitations. An For example, a company with a corporate headquarters, three enterprise-class solution that offers this level of flexibility is ideal for regional offices, and a secure disaster recovery (DR) facility should organizations looking for a choice of de-duplication methods. be able to implement de-duplication in the regional offices to facilitate efficient local storage and replication to the central site. For maximum manageability, the solution should allow for granular The solution should only require minimal bandwidth for the central (tape- or group-level) policy-based de-duplication based on a variety site to determine whether the remote data is contained in the central of factors: Resource utilization, production schedules, time since repository. Only unique data across all sites should be replicated to creation, and so on. In this way, storage economies can be achieved the central site and subsequently to the DR site, to avoid excessive while optimizing the use of system resources. bandwidth requirements. 5. Scalability 7. Real-time repository protection Because the solution is being chosen for longer-term data storage, Access to the de-duplicated data repository is critical and scalability, in terms of both capacity and performance, is an important should not be vulnerable to a single point of failure. A robust data consideration. Consider growth expectations over five years or de-duplication solution will include mirroring to protect against more. How much data will you want to keep on disk for fast access? local storage failure as well as replication to protect against disaster. How will the data index system scale to your requirements? The solution should have failover capability in the event of a node A de-duplication solution should provide an architecture that allows failure. Even if multiple nodes in a cluster fail, the company must be economic “right-sizing” for both the initial implementation and the able to continue to recover its data and respond to the business.
  • 4. Demystifying Data De-Duplication: Choosing the Best Solution Figure 2: An example of a clustered de-duplication architecture Summary: Focus on the Total Solution 8. Efficiency and effectiveness File-based de-duplication approaches do not reduce storage capacity As stored data volumes continually increase due to the demands of requirements as much as those that analyze data at a sub-file or block business applications and regulatory requirements, data de-duplica- level. Consider, for example, changing a single line in a 4MB presen- tion is fast becoming a vital technology. Data de-duplication is the tation. In a file-based solution, the entire 4MB file must be stored, only way to dramatically reduce data volumes, slash storage doubling the storage required. If the presentation is sent to multiple requirements, and minimize data protection costs and risks. people, as presentations often are, the negative effects multiply. Most sub-file de-duplication processes use some sort of “chunking” Although the benefits of data de-duplication are dramatic, method to break up a large amount of data, such as a virtual tape organizations should not be seduced by the hype sometimes cartridge, into smaller-sized pieces to search for duplicate data. attributed to the technology. No matter the approach, the amount Larger chunks of data can be processed at a faster rate, but less of data de-duplication that can occur is driven by the nature of the duplication is detected. It is easier to detect more duplication in data and the policies used to protect it. smaller chunks, but the overhead to scan the data is much higher. In order to achieve the maximum benefit of de-duplication, If the “chunking” begins at the beginning of a tape (or data stream organizations should choose data de-duplication solutions based in other implementations), the de-duplication process can be fooled on a comprehensive set of quantitative and qualitative factors by the metadata created by the backup software, even if the file is rather than relying solely on statistics such as theoretical data unchanged. However, if the solution can segregate the metadata reduction ratios. and look for duplication in chunks within actual data files, the duplication detection will be much higher. Some solutions even adjust chunk size based on information gleaned from the data formats. The combination of these techniques can lead to a 30-40% increase in the amount of duplicate data detected. This can have a major impact on the cost-effectiveness of the solution. For more information, visit www.falconstor.com or contact your local FalconStor representative. Corporate Headquarters European Headquarters Asia-Pacific Headquarters USA France Taiwan +1 631 777 5188 +33 1 39 23 95 50 +866 4 2259 1868 sales@falconstor.com infoeurope@falconstor.com infoasia@falconstor.com Information in this document is provided “AS IS” without warranty of any kind, and is subject to change without notice by FalconStor, which assumes no responsibility for any errors or claims herein. Copyright © 2008 FalconStor Software. All Rights Reserved. FalconStor Software and FalconStor are trademarks or registered trademarks of FalconStor Software, Inc. in the United States and other countries. All other company and product names contained herein are or may be trademarks of the respective holder. DDDWP080211.