SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Open Source Data Deduplication




                        Nick Webb
nickw@redwireservices.com www.redwireservices.com @RedWireServices
                          (206) 829-8621
                          Last updated 8/10/2011
Introduction
●   What is Deduplication? Different kinds?
●   Why do you want it?
●   How does it work?
●   Advantages / Drawbacks
●   Commercial Implementations
●   Open Source implementations, performance,
    reliability, and stability of each
What is Data Deduplication
Wikipedia:

. . . data deduplication is a specialized data compression
technique for eliminating coarse-grained redundant data,
typically to improve storage utilization. In the deduplication
process, duplicate data is deleted, leaving only one copy of
the data to be stored, along with references to the unique
copy of data. Deduplication is able to reduce the required
storage capacity since only the unique data is stored.

Depending on the type of deduplication, redundant files may
be reduced, or even portions of files or other data that are
similar can also be removed . . .
Why Dedupe?
●   Save disk space and money (less disks)
●   Less disks = less power, cooling, and space
●   Improve write performance (of duplicate data)
●   Be efficient – don’t re-copy or store previously
    stored data
Where does it Work Well?
●   Secondary Storage
    ●   Backups/Archives
    ●   Online backups with limited
        bandwidth/replication
    ●   Save disk space – additional
        full backups take little space
●   Virtual Machines (Primary &
    Secondary)
●   File Shares
Not a Fit
●   Random data
    ●   Video
    ●   Pictures
    ●   Music
    ●   Encrypted files
         –   many vendors dedupe, then encrypt
Types
●   Source / Target
●   Global
●   Fixed/Sliding Block
●   File Based (SIS)
Drawbacks
●   Slow writes, slower reads
●   High CPU/memory utilization (dedicated server
    is a must)
●   Increases data loss risk / corruption
    ●   Collision risk of 1.3x10^-49% chance per PB
    ●   (256 bit hash & 8KB Blocks)
How Does it Work?
Without Dedupe
With Dedupe
Block Reclamation
 ●   In general, blocks are not
     removed/freed when a file is
     removed
 ●   We must periodically check blocks
     for references, a block with no
     reference can be deleted, freeing
     allocated space
 ●   Process can be expensive,
     scheduled during off-peak
Commercial Implementations
●   Just about every backup vendor
    ●   Symantec, CommVault
    ●   Cloud: Asigra, Baracuda, Dropbox (global), JungleDisk,
        Mozy
●   NAS/SAN/Backup Targets
    ●   NEC HydraStor
    ●   DataDomain/EMC Avamar
    ●   Quantum
    ●   NetApp
Open Source Implementations
●   Fuse Based
    ●   Lessfs
    ●   SDFS (OpenDedupe)
●   Others
    ●   ZFS
    ●   btrfs (? Off-line only)
●   Limited (file based / SIS)
    ●   BackupPC (reliable!)
    ●   Rdiff-backup
How Good is it?
●   Many see 10-20x deduplicaiton meaning 10-20
    times more logical object storage than physical
●   Especially true in backup or virtual
    environments
SDFS / OpenDedupe
                 www.opendedup.org
●   Java 7 Based / platform agnostic
●   Uses fuse
●   S3 storage support
●   Snapshots
●   Inline or batch mode deduplication
●   Supposedly fast (290MBps+ on great H/W)
●   Support for global/clustered dedupe
●   Probably most mature OSS Dedupe (IMHO)
SDFS
SDFS Install & Go

Install Java
# rpm –Uvh SDFS-1.0.7-2.x86_64.rpm
# sudo mkfs.sdfs --volume-name=sdfs_128k 
     --io-max-file-write-buffers=32 
     --volume-capacity=550GB 
     --io-chunk-size=128 
     --chunk-store-data-location=/mnt/data
# sudo modprobe fuse
# sudo mount.sdfs -v sdfs_128k -m 
     /mnt/dedupe
SDFS
●   Pro
    ●   Works when configured properly
    ●   Appears to be multithreaded
●   Con
    ●   Slow / resource intensive (CPU/Memory)
    ●   Fragile, easy to mess up options, leading to crashes, little
        user feedback
    ●   Standard POSIX utilities do not show accurate data (e.g. df,
        must use getfattr -d <mount point>, and calculate bytes →
        GB/TB and % free yourself)
    ●   Slow with 4k blocks, recommended for VMs
LessFS
                          www.lessfs.com

●   Written in C = Less CPU Overhead
●   Have to build yourself (configure && make && make install)
●   Has replication, encryption
●   Uses fuse
LessFS Install
wget http://...lessfs-1.4.2.tar.gz
tar zxvf *.tar.gz
wget http://...db-4.8.30.tar.gz
yum install buildstuff…
. . .
echo never >
/sys/kernel/mm/redhat_transparent_hugepage/defrag
echo no >
/sys/kernel/mm/redhat_transparent_hugepage/khugep
aged/defrag
LessFS Go
sudo vi /etc/lessfs.cfg
BLOCKDATA_PATH=/mnt/data/dta/blockdata.dta
META_PATH=/mnt/meta/mta
BLKSIZE=4096 # only 4k supported on centos 5
ENCRYPT_DATA=on
ENCRYPT_META=off


mklessfs -c /etc/lessfs.cfg
lessfs /etc/lessfs.cfg /mnt/dedupe
LessFS
●   Pro
    ●   Does inline compression by default as well
    ●   Reasonable VM compression with 128k blocks
●   Con
    ●   Fragile
    ●   Stats/FS info hard to see (per file accounting, no totals)
    ●   Kernel >= 2.6.26 required for blocks > 4k (RHEL6 only)
    ●   Running with 4k blocks is not really feasible
LessFS
Other OSS
●   ZFS?
    ●   Tried it, and empirically it was a drag, but I have no
        hard data (got like 3x dedupe with identical full
        backups of VMs)
    ●   At least it’s stable…
Kick the Tires
●   Test data set; ~330GB of data
    ●   22GB of documents, pictures, music
    ●   Virtual Machines
        –   220GB Windows 2003 Server with SQL Data
        –   2003 AD DC ~60GB
        –   2003 Server ~8GB
        –   Two OpenSolaris VMs, 1.5 & 2.7GB
        –   3GB Windows 2000 VM
        –   15GB XP Pro VM
Kick the Tires
●   Test Environment
    ●   AWS High CPU Extra Large Instance
    ●   ~7GB of RAM
    ●   ~Eight Cores ~2.5GHz each
    ●   ext4
Compression Performance
●   First round (all “unique” data)
●   If another copy was put in (like another full), we should expect
    100% reduction for that non-unique data (1x dedupe per run)
      FS              Home   % Home      VM     % VM        Combined % Total     MBps
                      Data   Reduction   Data   Reduction            Reduction
      SDFS 4k         21GB   4.50%       109    64%         128GB    61%         16
                                         GB
      lessfs 4k       24GB   -9%         N/A    51%         N/A      50%         4
      (est.)
      SDFS 128k       21GB   4.50%       255    16%         276GB    15%         40
                                         GB
      lessfs 128k     21GB   4.50%       130    57%         183GB    44%         24
                                         GB
      tar/gz --fast   21GB   4.50%       178    41%         199GB    39%         35
                                         GB
Write Performance
                      (don't trust this)
                                  MBps

40


35


30


25


20                                                                          MBps


15


10


 5


 0
     raw    SDFS 4k   lessfs 4k   SDFS 128k   lessfs 128k   tar/gz --fast
Kick the Tires: Part 2
●   Test data set – two ~204GB full backup
    archives from a popular commercial vendor
●   Test Environment
    ●   VirtualBox VM, 2GB RAM, 2 Cores, 2x7200RPM
        SATA drives (meta & data separated for LessFS)
    ●   Physical CPU: Quad Core Xeon
Write Performance
                                 MBps

40


35


30


25


20                                                                           MBps


15


10


 5


 0
     raw   SDFS 128k W   SDFS 128k Re-W   LessFS 128k W   LessFS 128k Re-W
Load
(SDFS 128k)
Open Source Dedupe
●   Pro
    ●   Free
    ●   Can be stable, if well managed
●   Con
    ●   Not in repos yet
    ●   Efforts behind them seem very limited, 1 dev each
    ●   No/Poor documentation
The Future
●   Eventual Commodity?
●   brtfs
    ●   Dedupe planned (off-line only)
Conclusion/Recommendations
●   Dedupe is great, if it works and it meets your
    performance and storage requirements
●   OSS Dedupe has a way to go
●   SDFS/OpenDedupe is best OSS option right
    now
●   JungleDisk is good and cheap, but not OSS
About Red Wire Services
If you found this presentation helpful, consider
Red Wire Services for your next
Backup, Archive, or IT Disaster Recovery
Planning project.
Learn more at www.RedWireServices.com
About Nick Webb
Nick Webb is the founder of Red Wire Services, in
Seattle, WA. Nick is available to speak on a variety of IT
Disaster Recovery related topics, including:
●   Preserving Your Digital Legacy
●   Getting Started with your Small Business Disaster
    Recovery Plan
●   Archive Storage for SMBs
If interested in having Nick speak to your group, please
call (206) 829-8621 or email info@redwireservices.com

Weitere ähnliche Inhalte

Was ist angesagt?

Network Attached Storage (NAS)
Network Attached Storage (NAS)Network Attached Storage (NAS)
Network Attached Storage (NAS)
sandeepgodfather
 
GlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 MeetupGlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 Meetup
GlusterFS
 
Quantum NDX - NAS Based Data Protection
Quantum NDX - NAS Based Data Protection Quantum NDX - NAS Based Data Protection
Quantum NDX - NAS Based Data Protection
Quantum
 
Continuity Software 4.3 Detailed Gaps
Continuity Software 4.3 Detailed GapsContinuity Software 4.3 Detailed Gaps
Continuity Software 4.3 Detailed Gaps
GilHecht
 

Was ist angesagt? (18)

Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HA
 
CDW: SAN vs. NAS
CDW: SAN vs. NASCDW: SAN vs. NAS
CDW: SAN vs. NAS
 
IBM Spectrum Scale for File and Object Storage
IBM Spectrum Scale for File and Object StorageIBM Spectrum Scale for File and Object Storage
IBM Spectrum Scale for File and Object Storage
 
Network Attached Storage (NAS)
Network Attached Storage (NAS)Network Attached Storage (NAS)
Network Attached Storage (NAS)
 
GlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 MeetupGlusterFS Architecture - June 30, 2011 Meetup
GlusterFS Architecture - June 30, 2011 Meetup
 
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral ProgramBig Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
 
Ibm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ashIbm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ash
 
Walk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCWalk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoC
 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN Caching
 
Quantum NDX - NAS Based Data Protection
Quantum NDX - NAS Based Data Protection Quantum NDX - NAS Based Data Protection
Quantum NDX - NAS Based Data Protection
 
IBM Spectrum Scale Networking Flow
IBM Spectrum Scale Networking FlowIBM Spectrum Scale Networking Flow
IBM Spectrum Scale Networking Flow
 
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
 
Vancouver bug enterprise storage and zfs
Vancouver bug   enterprise storage and zfsVancouver bug   enterprise storage and zfs
Vancouver bug enterprise storage and zfs
 
Continuity Software 4.3 Detailed Gaps
Continuity Software 4.3 Detailed GapsContinuity Software 4.3 Detailed Gaps
Continuity Software 4.3 Detailed Gaps
 
958 and 959 sales exam prep
958 and 959 sales exam prep958 and 959 sales exam prep
958 and 959 sales exam prep
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
IBM Spectrum Scale Security
IBM Spectrum Scale Security IBM Spectrum Scale Security
IBM Spectrum Scale Security
 
Maginatics Cloud Storage Platform - MCSP 3.0 Technical Highlights
Maginatics Cloud Storage Platform - MCSP 3.0 Technical HighlightsMaginatics Cloud Storage Platform - MCSP 3.0 Technical Highlights
Maginatics Cloud Storage Platform - MCSP 3.0 Technical Highlights
 

Andere mochten auch

Source of Data in Research
Source of Data in ResearchSource of Data in Research
Source of Data in Research
Manu K M
 

Andere mochten auch (16)

Deduplication
DeduplicationDeduplication
Deduplication
 
[DPM 2015] PerfectDedup - Secure Data Deduplication for Cloud Storage
[DPM 2015] PerfectDedup - Secure Data Deduplication for Cloud Storage[DPM 2015] PerfectDedup - Secure Data Deduplication for Cloud Storage
[DPM 2015] PerfectDedup - Secure Data Deduplication for Cloud Storage
 
Netapp Deduplication concepts
Netapp Deduplication conceptsNetapp Deduplication concepts
Netapp Deduplication concepts
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Avamar presales 1.0
Avamar presales 1.0Avamar presales 1.0
Avamar presales 1.0
 
Accurate Hotel Mapping to Increase your Bookings
Accurate Hotel Mapping to Increase your BookingsAccurate Hotel Mapping to Increase your Bookings
Accurate Hotel Mapping to Increase your Bookings
 
Creative portfolio tavisca 2014
Creative portfolio tavisca 2014Creative portfolio tavisca 2014
Creative portfolio tavisca 2014
 
Presentation deduplication backup software and system
Presentation   deduplication backup software and systemPresentation   deduplication backup software and system
Presentation deduplication backup software and system
 
Hotel map process
Hotel map process   Hotel map process
Hotel map process
 
Secure auditing and deduplicating data in cloud
Secure auditing and deduplicating data in cloudSecure auditing and deduplicating data in cloud
Secure auditing and deduplicating data in cloud
 
Accurate Hotel Mapping to Increase your Bookings
Accurate Hotel Mapping to Increase your BookingsAccurate Hotel Mapping to Increase your Bookings
Accurate Hotel Mapping to Increase your Bookings
 
A Hybrid Cloud Approach for Secure Authorized Deduplication
A Hybrid Cloud Approach for Secure Authorized DeduplicationA Hybrid Cloud Approach for Secure Authorized Deduplication
A Hybrid Cloud Approach for Secure Authorized Deduplication
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiers
 
Source of Data in Research
Source of Data in ResearchSource of Data in Research
Source of Data in Research
 
EMC Deduplication Fundamentals
EMC Deduplication FundamentalsEMC Deduplication Fundamentals
EMC Deduplication Fundamentals
 
Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
Deduplication Using Solr: Presented by Neeraj Jain, StubhubDeduplication Using Solr: Presented by Neeraj Jain, Stubhub
Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
 

Ähnlich wie Open Source Data Deduplication

UKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL Tuning
FromDual GmbH
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
JAXLondon2014
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems
Baruch Osoveskiy
 

Ähnlich wie Open Source Data Deduplication (20)

VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
 
Red Hat Gluster Storage Performance
Red Hat Gluster Storage PerformanceRed Hat Gluster Storage Performance
Red Hat Gluster Storage Performance
 
Database performance tuning for SSD based storage
Database  performance tuning for SSD based storageDatabase  performance tuning for SSD based storage
Database performance tuning for SSD based storage
 
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
 
SSD based storage tuning for databases
SSD based storage tuning for databasesSSD based storage tuning for databases
SSD based storage tuning for databases
 
Shootout at the AWS Corral
Shootout at the AWS CorralShootout at the AWS Corral
Shootout at the AWS Corral
 
UKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL Tuning
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax London
 
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash TechnologyCeph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Exploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient WorkflowsExploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient Workflows
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
 
Five steps perform_2009 (1)
Five steps perform_2009 (1)Five steps perform_2009 (1)
Five steps perform_2009 (1)
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Open Source Data Deduplication

  • 1. Open Source Data Deduplication Nick Webb nickw@redwireservices.com www.redwireservices.com @RedWireServices (206) 829-8621 Last updated 8/10/2011
  • 2. Introduction ● What is Deduplication? Different kinds? ● Why do you want it? ● How does it work? ● Advantages / Drawbacks ● Commercial Implementations ● Open Source implementations, performance, reliability, and stability of each
  • 3. What is Data Deduplication Wikipedia: . . . data deduplication is a specialized data compression technique for eliminating coarse-grained redundant data, typically to improve storage utilization. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored, along with references to the unique copy of data. Deduplication is able to reduce the required storage capacity since only the unique data is stored. Depending on the type of deduplication, redundant files may be reduced, or even portions of files or other data that are similar can also be removed . . .
  • 4. Why Dedupe? ● Save disk space and money (less disks) ● Less disks = less power, cooling, and space ● Improve write performance (of duplicate data) ● Be efficient – don’t re-copy or store previously stored data
  • 5. Where does it Work Well? ● Secondary Storage ● Backups/Archives ● Online backups with limited bandwidth/replication ● Save disk space – additional full backups take little space ● Virtual Machines (Primary & Secondary) ● File Shares
  • 6. Not a Fit ● Random data ● Video ● Pictures ● Music ● Encrypted files – many vendors dedupe, then encrypt
  • 7. Types ● Source / Target ● Global ● Fixed/Sliding Block ● File Based (SIS)
  • 8. Drawbacks ● Slow writes, slower reads ● High CPU/memory utilization (dedicated server is a must) ● Increases data loss risk / corruption ● Collision risk of 1.3x10^-49% chance per PB ● (256 bit hash & 8KB Blocks)
  • 9. How Does it Work?
  • 12. Block Reclamation ● In general, blocks are not removed/freed when a file is removed ● We must periodically check blocks for references, a block with no reference can be deleted, freeing allocated space ● Process can be expensive, scheduled during off-peak
  • 13. Commercial Implementations ● Just about every backup vendor ● Symantec, CommVault ● Cloud: Asigra, Baracuda, Dropbox (global), JungleDisk, Mozy ● NAS/SAN/Backup Targets ● NEC HydraStor ● DataDomain/EMC Avamar ● Quantum ● NetApp
  • 14. Open Source Implementations ● Fuse Based ● Lessfs ● SDFS (OpenDedupe) ● Others ● ZFS ● btrfs (? Off-line only) ● Limited (file based / SIS) ● BackupPC (reliable!) ● Rdiff-backup
  • 15. How Good is it? ● Many see 10-20x deduplicaiton meaning 10-20 times more logical object storage than physical ● Especially true in backup or virtual environments
  • 16. SDFS / OpenDedupe www.opendedup.org ● Java 7 Based / platform agnostic ● Uses fuse ● S3 storage support ● Snapshots ● Inline or batch mode deduplication ● Supposedly fast (290MBps+ on great H/W) ● Support for global/clustered dedupe ● Probably most mature OSS Dedupe (IMHO)
  • 17. SDFS
  • 18. SDFS Install & Go Install Java # rpm –Uvh SDFS-1.0.7-2.x86_64.rpm # sudo mkfs.sdfs --volume-name=sdfs_128k --io-max-file-write-buffers=32 --volume-capacity=550GB --io-chunk-size=128 --chunk-store-data-location=/mnt/data # sudo modprobe fuse # sudo mount.sdfs -v sdfs_128k -m /mnt/dedupe
  • 19. SDFS ● Pro ● Works when configured properly ● Appears to be multithreaded ● Con ● Slow / resource intensive (CPU/Memory) ● Fragile, easy to mess up options, leading to crashes, little user feedback ● Standard POSIX utilities do not show accurate data (e.g. df, must use getfattr -d <mount point>, and calculate bytes → GB/TB and % free yourself) ● Slow with 4k blocks, recommended for VMs
  • 20. LessFS www.lessfs.com ● Written in C = Less CPU Overhead ● Have to build yourself (configure && make && make install) ● Has replication, encryption ● Uses fuse
  • 21. LessFS Install wget http://...lessfs-1.4.2.tar.gz tar zxvf *.tar.gz wget http://...db-4.8.30.tar.gz yum install buildstuff… . . . echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag echo no > /sys/kernel/mm/redhat_transparent_hugepage/khugep aged/defrag
  • 22. LessFS Go sudo vi /etc/lessfs.cfg BLOCKDATA_PATH=/mnt/data/dta/blockdata.dta META_PATH=/mnt/meta/mta BLKSIZE=4096 # only 4k supported on centos 5 ENCRYPT_DATA=on ENCRYPT_META=off mklessfs -c /etc/lessfs.cfg lessfs /etc/lessfs.cfg /mnt/dedupe
  • 23. LessFS ● Pro ● Does inline compression by default as well ● Reasonable VM compression with 128k blocks ● Con ● Fragile ● Stats/FS info hard to see (per file accounting, no totals) ● Kernel >= 2.6.26 required for blocks > 4k (RHEL6 only) ● Running with 4k blocks is not really feasible
  • 25. Other OSS ● ZFS? ● Tried it, and empirically it was a drag, but I have no hard data (got like 3x dedupe with identical full backups of VMs) ● At least it’s stable…
  • 26. Kick the Tires ● Test data set; ~330GB of data ● 22GB of documents, pictures, music ● Virtual Machines – 220GB Windows 2003 Server with SQL Data – 2003 AD DC ~60GB – 2003 Server ~8GB – Two OpenSolaris VMs, 1.5 & 2.7GB – 3GB Windows 2000 VM – 15GB XP Pro VM
  • 27. Kick the Tires ● Test Environment ● AWS High CPU Extra Large Instance ● ~7GB of RAM ● ~Eight Cores ~2.5GHz each ● ext4
  • 28. Compression Performance ● First round (all “unique” data) ● If another copy was put in (like another full), we should expect 100% reduction for that non-unique data (1x dedupe per run) FS Home % Home VM % VM Combined % Total MBps Data Reduction Data Reduction Reduction SDFS 4k 21GB 4.50% 109 64% 128GB 61% 16 GB lessfs 4k 24GB -9% N/A 51% N/A 50% 4 (est.) SDFS 128k 21GB 4.50% 255 16% 276GB 15% 40 GB lessfs 128k 21GB 4.50% 130 57% 183GB 44% 24 GB tar/gz --fast 21GB 4.50% 178 41% 199GB 39% 35 GB
  • 29.
  • 30. Write Performance (don't trust this) MBps 40 35 30 25 20 MBps 15 10 5 0 raw SDFS 4k lessfs 4k SDFS 128k lessfs 128k tar/gz --fast
  • 31. Kick the Tires: Part 2 ● Test data set – two ~204GB full backup archives from a popular commercial vendor ● Test Environment ● VirtualBox VM, 2GB RAM, 2 Cores, 2x7200RPM SATA drives (meta & data separated for LessFS) ● Physical CPU: Quad Core Xeon
  • 32. Write Performance MBps 40 35 30 25 20 MBps 15 10 5 0 raw SDFS 128k W SDFS 128k Re-W LessFS 128k W LessFS 128k Re-W
  • 33.
  • 35. Open Source Dedupe ● Pro ● Free ● Can be stable, if well managed ● Con ● Not in repos yet ● Efforts behind them seem very limited, 1 dev each ● No/Poor documentation
  • 36. The Future ● Eventual Commodity? ● brtfs ● Dedupe planned (off-line only)
  • 37. Conclusion/Recommendations ● Dedupe is great, if it works and it meets your performance and storage requirements ● OSS Dedupe has a way to go ● SDFS/OpenDedupe is best OSS option right now ● JungleDisk is good and cheap, but not OSS
  • 38. About Red Wire Services If you found this presentation helpful, consider Red Wire Services for your next Backup, Archive, or IT Disaster Recovery Planning project. Learn more at www.RedWireServices.com
  • 39. About Nick Webb Nick Webb is the founder of Red Wire Services, in Seattle, WA. Nick is available to speak on a variety of IT Disaster Recovery related topics, including: ● Preserving Your Digital Legacy ● Getting Started with your Small Business Disaster Recovery Plan ● Archive Storage for SMBs If interested in having Nick speak to your group, please call (206) 829-8621 or email info@redwireservices.com

Hinweis der Redaktion

  1. Different types of deduplication levels:File levelBlock levelVariable block versus fixed block Quantum/DD Variable Blocks
  2. Pretty much the same as all compression