Deduplication - Remove Duplicate

Data deDuplication - deDup
deDuplication - English
Deduplikace - Čeština
Deduplikation - Deutsch
Deduplicación Español
Déduplication - Français
Datadeduplisering - Norsk
Д -
Д -
ttp://www.text-filter.com/tools/remove-duplicate-lines/
In computing, data deduplication is a specialized data compression
technique for eliminating duplicate copies of repeating data. Related and
somewhat synonymous terms are intelligent (data) compression and single-
instance (data) storage. This technique is used to improve storage
utilization and can also be applied to network data transfers to reduce
the number of bytes that must be sent. In the deduplication process,
unique chunks of data, or byte patterns, are identified and stored during
a process of analysis. As the analysis continues, other chunks are
compared to the stored copy and whenever a match occurs, the redundant
chunk is replaced with a small reference that points to the stored chunk.
Given that the same byte pattern may occur dozens, hundreds, or even
thousands of times (the match frequency is dependent on the chunk size),
the amount of data that must be stored or transferred can be greatly
reduced.[1]
This type of deduplication is different from that performed by standard
file-compression tools, such as LZ77 and LZ78. Whereas these tools
identify short repeated substrings inside individual files, the intent of
storage-based data deduplication is to inspect large volumes of data and
identify large sections – such as entire files or large sections of files
– that are identical, in order to store only one copy of it. This copy
may be additionally compressed by single-file compression techniques. For
example, a typical email system might contain 100 instances of the same 1
MB (megabyte) file attachment. Each time the email platform is backed up,
all 100 instances of the attachment are saved, requiring 100 MB storage
space. With data deduplication, only one instance of the attachment is
actually stored; the subsequent instances are referenced back to the
saved copy for deduplication ratio of roughly 100 to 1.
Storage-based data deduplication reduces the amount of storage needed for
a given set of files. It is most effective in applications where many
copies of very similar or even identical data are stored on a single
disk—a surprisingly common scenario. In the case of data backups, which
routinely are performed to protect against data loss, most data in a
given backup remain unchanged from the previous backup. Common backup
systems try to exploit this by omitting (or hard linking) files that
haven't changed or storing differences between files. Neither approach
captures all redundancies, however. Hard-linking does not help with large
files that have only changed in small ways, such as an email database;
differences only find redundancies in adjacent versions of a single file
(consider a section that was deleted and later added in again, or a logo
image included in many documents).

Network data deduplication is used to reduce the number of bytes that
must be transferred between endpoints, which can reduce the amount of
bandwidth required. See WAN optimization for more information.
Virtual servers and virtual desktops benefit from deduplication because
it allows nominally separate system files for each virtual machine to be
coalesced into a single storage space. At the same time, if a given
virtual machine customizes a file, deduplication will not change the
files on the other virtual machines—something that alternatives like hard
links or shared disks do not offer. Backing up or making duplicate copies
of virtual environments is similarly improved.
Deduplication overview
Deduplication may occur "in-line", as data is flowing, or "post-process"
after it has been written.
Post-process deduplication
With post-process deduplication, new data is first stored on the storage
device and then a process at a later time will analyze the data looking
for duplication. The benefit is that there is no need to wait for the
hash calculations and lookup to be completed before storing the data,
thereby ensuring that store performance is not degraded. Implementations
offering policy-based operation can give users the ability to defer
optimization on "active" files, or to process files based on type and
location. One potential drawback is that duplicate data may be
unnecessarily stored for a short time, which can be problematic if the
system is nearing full capacity.
In-line deduplication
Alternatively, deduplication hash calculations can be done synchronized
as data enters the target device. If the storage system identifies a
block which it has already stored, only a reference to the existing block
is stored, rather than the whole new block.
The advantage of in-line deduplication over post-process deduplication is
that it requires less storage, since duplicate data is never stored. On
the negative side, it is frequently argued[by whom?] that because hash
calculations and lookups take so long, data ingestion can be slower,
thereby reducing the backup throughput of the device. However, certain
vendors with in-line deduplication have demonstrated equipment with
similar performance to their post-process deduplication
counterparts.[according to whom?]
Data coming in is stored into "lining space" before it hits real storage
blocks. On SSD disks lining space is provided using NVRAM which is not
cost-efficient.[according to whom?]
Post-process and in-line deduplication methods are often heavily
debated.[2][3]
Data formats
SNIA Dictionary identifies two methods:
content-agnostic data deduplication - a data deduplication method that
does not require awareness of specific application data formats.
content-aware data deduplication - a data deduplication method that
leverages knowledge of specific application data formats.
Source versus target deduplication
Another way to classify data deduplication methods is according to where
they occur. Deduplication occurring close to where data is created, is

often referred to[according to whom?] as "source deduplication". When it
occurs near where the data is stored, it is commonly called "target
deduplication".
Source deduplication ensures that data on the data source is
deduplicated. This generally takes place directly within a file
system.[4][5] The file system will periodically scan new files creating
hashes and compare them to hashes of existing files. When files with same
hashes are found then the file copy is removed and the new file points to
the old file. Unlike hard links however, duplicated files are considered
to be separate entities and if one of the duplicated files is later
modified, then using a system called copy-on-write a copy of that file or
changed block is created. The deduplication process is transparent to the
users and backup applications. Backing up a deduplicated file system will
often cause duplication to occur resulting in the backups being bigger
than the source data.
Target deduplication is the process of removing duplicates when the data
was not generated at that location. Example of this would be a server
connected to a SAN/NAS, The SAN/NAS would be a target for the server
(Target deduplication). The server is not aware of any deduplication, the
server is also the point of data generation.
A second example would be backup. Generally this will be a backup store
such as a data repository or a virtual tape library.
Deduplication methods
One of the most common forms of data deduplication implementations works
by comparing chunks of data to detect duplicates. For that to happen,
each chunk of data is assigned an identification, calculated by the
software, typically using cryptographic hash functions. In many
implementations, the assumption is made that if the identification is
identical, the data is identical, even though this cannot be true in all
cases due to the pigeonhole principle; other implementations do not
assume that two blocks of data with the same identifier are identical,
but actually verify that data with the same identification is
identical.[6] If the software either assumes that a given identification
already exists in the deduplication namespace or actually verifies the
identity of the two blocks of data, depending on the implementation, then
it will replace that duplicate chunk with a link.
Once the data has been deduplicated, upon read back of the file, wherever
a link is found, the system simply replaces that link with the referenced
data chunk. The deduplication process is intended to be transparent to
end users and applications.
Commercial deduplication implementations differ by their chunking methods
and architectures.
Chunking. In some systems, chunks are defined by physical layer
constraints (e.g. 4KB block size in WAFL). In some systems only complete
files are compared, which is called single-instance storage or SIS. The
most intelligent (but CPU intensive) method to chunking is generally
considered to be sliding-block. In sliding block, a window is passed
along the file stream to seek out more naturally occurring internal file
boundaries.
Client backup deduplication. This is the process where the deduplication
hash calculations are initially created on the source (client) machines.
Files that have identical hashes to files already in the target device
are not sent, the target device just creates appropriate internal links
to reference the duplicated data. The benefit of this is that it avoids

data being unnecessarily sent across the network thereby reducing traffic
load.
Primary storage and secondary storage. By definition, primary storage
systems are designed for optimal performance, rather than lowest possible
cost. The design criteria for these systems is to increase performance,
at the expense of other considerations. Moreover, primary storage systems
are much less tolerant of any operation that can negatively impact
performance. Also by definition, secondary storage systems contain
primarily duplicate, or secondary copies of data. These copies of data
are typically not used for actual production operations and as a result
are more tolerant of some performance degradation, in exchange for
increased efficiency.
To date, data deduplication has predominantly been used with secondary
storage systems. The reasons for this are two-fold. First, data
deduplication requires overhead to discover and remove the duplicate
data. In primary storage systems, this overhead may impact performance.
The second reason why deduplication is applied to secondary data, is that
secondary data tends to have more duplicate data. Backup application in
particular commonly generate significant portions of duplicate data over
time.
Data deduplication has been deployed successfully with primary storage in
some cases where the system design does not require significant overhead,
or impact performance.
Drawbacks and concerns
Whenever data is transformed, concerns arise about potential loss of
data. By definition, data deduplication systems store data differently
from how it was written. As a result, users are concerned with the
integrity of their data. The various methods of deduplicating data all
employ slightly different techniques. However, the integrity of the data
will ultimately depend upon the design of the deduplicating system, and
the quality used to implement the algorithms. As the technology has
matured over the past decade, the integrity of most of the major products
has been well proven.[citation needed]
One method for deduplicating data relies on the use of cryptographic hash
functions to identify duplicate segments of data. If two different pieces
of information generate the same hash value, this is known as a
collision. The probability of a collision depends upon the hash function
used, and although the probabilities are small, they are always non zero.
Thus, the concern arises that data corruption can occur if a hash
collision occurs, and additional means of verification are not used to
verify whether there is a difference in data, or not. Both in-line and
post-process architectures may offer bit-for-bit validation of original
data for guaranteed data integrity.[7] The hash functions used include
standards such as SHA-1, SHA-256 and others.
The computational resource intensity of the process can be a drawback of
data deduplication. However, this is rarely an issue for stand-alone
devices or appliances, as the computation is completely offloaded from
other systems. This can be an issue when the deduplication is embedded
within devices providing other services. To improve performance, many
systems utilize both weak and strong hashes. Weak hashes are much faster
to calculate but there is a greater risk of a hash collision. Systems
that utilize weak hashes will subsequently calculate a strong hash and
will use it as the determining factor to whether it is actually the same
data or not. Note that the system overhead associated with calculating
and looking up hash values is primarily a function of the deduplication
workflow. The reconstitution of files does not require this processing

and any incremental performance penalty associated with re-assembly of
data chunks is unlikely to impact application performance.
Another area of concern with deduplication is the related effect on
snapshots, backup, and archival, especially where deduplication is
applied against primary storage (for example inside a NAS filer).[further
explanation needed] Reading files out of a storage device causes full
reconstitution of the files (also known as rehydration), so any secondary
copy of the data set is likely to be larger than the primary copy. In
terms of snapshots, if a file is snapshotted prior to deduplication, the
post-deduplication snapshot will preserve the entire original file. This
means that although storage capacity for primary file copies will shrink,
capacity required for snapshots may expand dramatically.
Another concern is the effect of compression and encryption. Although
deduplication is a version of compression, it works in tension with
traditional compression. Deduplication achieves better efficiency against
smaller data chunks, whereas compression achieves better efficiency
against larger chunks. The goal of encryption is to eliminate any
discernible patterns in the data. Thus encrypted data cannot be
deduplicated, even though the underlying data may be redundant.
Deduplication ultimately reduces redundancy. If this was not expected and
planned for, this may ruin the underlying reliability of the system.
(Compare this, for example, to the LOCKSS storage architecture that
achieves reliability through multiple copies of data.)
Scaling has also been a challenge for deduplication systems because
ideally, the scope of deduplication needs to be shared across storage
devices. If there are multiple disk backup devices in an infrastructure
with discrete deduplication, then space efficiency is adversely affected.
A deduplication shared across devices preserves space efficiency, but is
technically challenging from a reliability and performance
perspective.[citation needed]
Although not a shortcoming of data deduplication, there have been data
breaches[citation needed] when insufficient security and access
validation procedures are used with large repositories of deduplicated
data. In some systems, as typical with cloud storage,[citation needed] an
attacker can retrieve data owned by others by knowing or guessing the
hash value of the desired data.[8]
How does Dedup work?
The data deduplication process works by eliminating redundant data and
ensuring that only the first unique instance of any data is actually
retained. Subsequent iterations of the data are replaced with a pointer
to the original. Data deduplication can operate at the file, block or bit
level.
----------------------------------------------------------
Deduplicación de datos - Español
Ir a la navegaciónIr a la búsqueda
En informática, la deduplicación de datos es una técnica especializada de
compresión de datos para eliminar copias duplicadas de datos repetidos.
Un término relacionado con la deduplicación de datos es la compresión
inteligente de datos. Esta técnica se usa para optimizar el
almacenamiento de datos en disco y también para reducir la cantidad de

información que debe enviarse de un dispositivo a otro a través de redes
de comunicación.
Una aplicación de deduplicación es reducir la cantidad de datos al crear
copias de seguridad de sistemas grandes.
----------------------------------------------------------
En informatique, la déduplication (également appelée factorisation ou
stockage d'instance unique) est une technique de stockage de données,
consistant à factoriser des séquences de données identiques afin
d'économiser l'espace utilisé.
Chaque fichier est découpé en une multitude de tronçons. À chacun de ces
tronçons est associé un identifiant unique, ces identifiants étant
stockés dans un index. L'objectif de la déduplication est de ne stocker
qu'une seule fois un même tronçon. Aussi, une nouvelle occurrence d'un
tronçon déjà présent n'est pas à nouveau sauvegardée, mais remplacée par
un pointeur vers l'identifiant correspondant.
La déduplication est utilisée en particulier sur des solutions du type
VTL (Virtual Tape Library) ou tout autre type de système de sauvegarde.
Méthodes de déduplication
Déduplication hors ligne
Les données à sauvegarder sont recopiées sur un espace disque tampon, et
dans un deuxième temps une recherche des blocs en double est réalisée.
Cette méthode nécessite un espace de stockage important. C'est le
principe des solutions Falconstor ou Quantum DXi en firmware 1.x par
exemple.
Déduplication en ligne
Les données à sauvegarder sont analysées "à la volée", et une table
d'index des blocs identiques est gérée (solution NexentaStor de Nexenta
Systems, Data Domain de EMC Corporation ou IBM ProtecTIER)1.
Déduplication à la source
Des agents répartis sur les serveurs à sauvegarder analysent les données
à la source (solution EMC Avamar notamment)1.
Principe
L'index créé lors de la sauvegarde est utilisé pour restituer les données
au bon endroit. Les fichiers ou les blocs en double dans l'index sont
dupliqués au moment de la restauration. L'expérience montre qu'en
pratique le taux de déduplication augmente dans le temps, car en pratique
peu de données changent entre deux sauvegardes totales. D'autre part le
taux de réduction obtenu dépend fortement du type de données traitées2.
Inconvénients de la déduplication
Risque de perte de données car les données ne sont pas en double et donc
le support utilisé doit être fiable. La réduction de la taille des
sauvegardes est un avantage par rapport à d'autres types de sauvegarde,
mais au détriment de la sécurité des données. Par conséquent, il est
recommandé de créer des doubles des supports de stockage.
Perte du format d'origine, ce qui dans certains cas pose des problèmes de
conformité aux contraintes légales (par exemple Bâle II). Certaines
solutions proposent pour cela de générer les données sensibles sur

cartouche au format initial, pour s'affranchir d'une éventuelle
défaillance de la VTL par exemple.
Avantage de la déduplication
L'avantage le plus important est la réduction d'espace occupé par les
sauvegardes : selon le cabinet Gartner, cette technologie permet de
diviser par 20 voire par 30 les besoins en espace de stockage3.
Un avantage indirect, conséquence du précédent, est la diminution de la
bande passante nécessaire à la sauvegarde dans le cas de la déduplication
à la source
----------------------------------------------------------
Deduplikation (aus englisch deduplication), auch Datendeduplikation oder
Deduplizierung, ist in der Informationstechnik ein Prozess, der
redundante Daten identifiziert (Duplikaterkennung) und eliminiert, bevor
diese auf einen nichtflüchtigen Datenträger geschrieben werden. Der
Prozess komprimiert wie andere Verfahren auch die Datenmenge, die von
einem Sender an einen Empfänger geschickt wird. Es ist nahezu unmöglich,
die Effizienz bei der Verwendung von Deduplikations-Algorithmen
vorherzusagen, da sie immer von der Datenstruktur und der Änderungsrate
abhängig ist. Deduplikation kann eine sehr effiziente Methode sein,
Datenmengen zu reduzieren, bei denen eine Mustererkennung möglich ist
(unverschlüsselte Daten).
Vorrangiges Einsatzgebiet der Deduplikation ist vorerst die
Datensicherung (Backup), bei der sich in der Praxis meistens eine
stärkere Datenkomprimierung als mit anderen Methoden erzielen lässt. Das
Verfahren eignet sich grundsätzlich für jeden Einsatzbereich, bei dem
Daten wiederholt kopiert werden.
Funktionsweise
Deduplikations-Systeme unterteilen die Dateien in Blöcke gleicher Größe
(meist Zweierpotenzen) und berechnen für jeden Block eine Prüfsumme.
Hierin liegt auch die Abgrenzung zum Single Instance Storage (SIS), das
identische Dateien eliminieren soll (siehe auch inhaltsadressierte
Speichersysteme, CAS).
Alle Prüfsummen werden anschließend zusammen mit einem Verweis auf die
entsprechende Datei und die Position innerhalb der Datei gespeichert.
Kommt eine neue Datei hinzu, so wird auch ihr Inhalt in Blöcke unterteilt
und daraus die Prüfsummen berechnet. Anschließend wird verglichen, ob
eine Prüfsumme bereits existiert. Dieser Vergleich der Prüfsummen ist
wesentlich schneller, als die Dateiinhalte direkt miteinander zu
vergleichen. Wird eine identische Prüfsumme gefunden, ist dies ein
Hinweis darauf, dass möglicherweise ein identischer Datenblock gefunden
wurde, es muss allerdings noch geprüft werden, ob die Inhalte tatsächlich
identisch sind, da es sich auch um eine Kollision handeln kann.
Wurde ein identischer Datenblock gefunden, wird einer der Blöcke entfernt
und stattdessen nur ein Verweis auf den anderen Datenblock gespeichert.
Dieser Verweis benötigt weniger Speicherplatz, als der Block selbst.
Für die Selektion der Blöcke gibt es zwei Methoden. Beim "Reverse-
Referencing" wird der erste gemeinsame Block gespeichert, alle weiteren
identischen erhalten einen Verweis auf den ersten. Das "Forward-
Referencing" legt immer den zuletzt aufgetretenen gemeinsamen Datenblock
ab und referenziert die vorher aufgetretenen Elemente. Bei diesem

Methodenstreit geht es darum, ob Daten schneller gespeichert oder
schneller wiederhergestellt werden sollen. Weitere Vorgehensweisen, wie
"Inband" und "Outband", konkurrieren darum, ob der Datenstrom "on the
fly", also im laufenden Betrieb, analysiert wird oder erst nachdem dieser
am Zielort gespeichert worden ist. Im ersten Fall darf nur ein Datenstrom
existieren, im zweiten können die Daten mittels mehrerer Datenströme
parallel untersucht werden.
Beispiel
Bei der Datensicherung von Festplatten auf Bandmedien ist das Verhältnis
von neuen bzw. veränderten zu unveränderten Daten zwischen zwei
Vollsicherungen meist nur relativ gering. Zwei Vollsicherungen benötigen
bei der klassischen Datensicherung aber trotzdem mindestens die doppelte
Speicherkapazität auf dem Band, verglichen mit den Originaldaten. Die
Deduplikation erkennt die identischen Datenbestandteile. In einer Liste
werden dazu eindeutige Segmente festgehalten, und beim erneuten Auftreten
dieses Datenteils werden Zeitpunkt und Ort im Datenstrom notiert, so dass
letztlich die Originaldaten wiederhergestellt werden können.
Allerdings handelt es sich damit nicht mehr um voneinander unabhängige
Vollsicherungen, d h. dass der Verlust eines Versionsstandes zu
unwiederbringlichem Datenverlust führt. Deduplikation verzichtet somit,
ähnlich der inkrementellen Sicherungen, auf Datensicherheit und zugunsten
des Speicherbedarfs.
Chunking (Fingerprinting)
Das Fingerprinting versucht auch festzustellen, wie der eingehende
Datenstrom am besten in Stücke zerlegt werden kann, so dass möglichst
viele identische Datenblöcke entstehen. Dieser Vorgang heißt Chunking
(von englisch chunk ‚Stück‘, ‚Block‘).
Identifikation von Blöcken
Je genauer die Änderungen einer Datei bestimmt werden können, desto
weniger muss redundant gesichert werden. Allerdings vergrößert sich
dadurch der Index, also der Bauplan, wie und aus welchen Bestandteilen
die Datei beim Aufruf wieder zusammengesetzt wird. Daher ist auch die
Methode der Identifikation von gemeinsamen Blöcken entscheidend.
----------------------------------------------------------
Д -
Д ( ; . deduplicatio —
) — ,
. Д
,
.
В
( . chunks).
.
,
( ),
, , .
, ё ,
, ё .

, LZ77 LZO. Э
ё ( « »),
.
Д ё
. ,
,
,
.
, ё
ё .
, ( ,
)
.
В
,
.
,
.
----------------------------------------------------------
Datadeduplisering er innenfor informatikken navnet på en spesialisert
datakompresjonsteknikk som eliminerer duplikate kopier av repeterende
data. Beslektede og tildels synonyme benevnelser er intelligent
(data)kompresjon og enkeltinstanslagring. Teknikken brukes for å forbedre
utnyttelsen av lagringsmedia og kan også anvendes på dataoverføring i et
datanett for å redusere antall byte som må sendes. Under deduplisering
blir unike mengder med data, eller byte-mønstre, identifisert og lagret i
en prosess av analyse. Etterhvert som analysen fortsetter, blir andre
datamengder sammenlignet med den lagrede kopi og når en match inntreffer,
blir den redundante datamengden erstattet med en liten referanse som
peker mot den lagrede datamengden. Ettersom samme datamengde kan
inntreffe dusinvis, hundrevis eller tusenvis av ganger, kan den lagrede
eller overførte datamengde reduseres betraktelig
----------------------------------------------------------
Д
Д ( . deduplicatio — ) — ,
.
' .
В ,
. , ,
. З , є ,

,
.
В є .
є є
:
є, є ,
.
є, є .
, ,
.
Д є ,
є ,
, .
є
—
є.
Д є :
, , ZFS[1], IPFS[2];
— є ;
— є
snapshot- ,
;
— є
.
:
;
;
.
ZFS[3]. Д
RLE.
Д ,
є GNU- fdupes. В є
.
,
[4]:
( , SHA-1, SHA-256, MD5) ;
;
(fletcher4) є .
----------------------------------------------------------
Zjednodušené schema deduplikace
Deduplikace je speciální technika komprese dat, která zabraňuje ukládání
stejných datových bloků na jednom úložišti. Deduplikační jednotka ukládá
informace (referenční informace) o datové struktuře a díky tomu je
schopná při zpětném čtení deduplikovaných dat zpět obnovit původní,

komplexní informaci. Účelem deduplikace je úspora místa na datovém
úložišti. Kromě této varianty, tzv. blokové deduplikace, existuje ještě
deduplikace na úrovni souborů, kdy je ukládána pouze jedna kopie
(instance) souboru/přílohy e-mailu. Příkladem budiž ukládání e-mailových
zpráv v systému Microsoft Exchange[1], nebo Windows Single Instance
Storage[2].
eDuplication - English
Deduplicación Español
Д -
Д -
DeDuplication , Online - Text Tools
http://www.text-filter.com/tools

Deduplication - Remove Duplicate

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Ähnlich wie Deduplication - Remove Duplicate

Ähnlich wie Deduplication - Remove Duplicate (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Deduplication - Remove Duplicate