SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Downloaden Sie, um offline zu lesen
Data deDuplication - deDup
deDuplication - English
Deduplikace - Čeština
Deduplikation - Deutsch
Deduplicación Español
Déduplication - Français
Datadeduplisering - Norsk
Д -
Д -
ttp://www.text-filter.com/tools/remove-duplicate-lines/
In computing, data deduplication is a specialized data compression
technique for eliminating duplicate copies of repeating data. Related and
somewhat synonymous terms are intelligent (data) compression and single-
instance (data) storage. This technique is used to improve storage
utilization and can also be applied to network data transfers to reduce
the number of bytes that must be sent. In the deduplication process,
unique chunks of data, or byte patterns, are identified and stored during
a process of analysis. As the analysis continues, other chunks are
compared to the stored copy and whenever a match occurs, the redundant
chunk is replaced with a small reference that points to the stored chunk.
Given that the same byte pattern may occur dozens, hundreds, or even
thousands of times (the match frequency is dependent on the chunk size),
the amount of data that must be stored or transferred can be greatly
reduced.[1]
This type of deduplication is different from that performed by standard
file-compression tools, such as LZ77 and LZ78. Whereas these tools
identify short repeated substrings inside individual files, the intent of
storage-based data deduplication is to inspect large volumes of data and
identify large sections – such as entire files or large sections of files
– that are identical, in order to store only one copy of it. This copy
may be additionally compressed by single-file compression techniques. For
example, a typical email system might contain 100 instances of the same 1
MB (megabyte) file attachment. Each time the email platform is backed up,
all 100 instances of the attachment are saved, requiring 100 MB storage
space. With data deduplication, only one instance of the attachment is
actually stored; the subsequent instances are referenced back to the
saved copy for deduplication ratio of roughly 100 to 1.
Storage-based data deduplication reduces the amount of storage needed for
a given set of files. It is most effective in applications where many
copies of very similar or even identical data are stored on a single
disk—a surprisingly common scenario. In the case of data backups, which
routinely are performed to protect against data loss, most data in a
given backup remain unchanged from the previous backup. Common backup
systems try to exploit this by omitting (or hard linking) files that
haven't changed or storing differences between files. Neither approach
captures all redundancies, however. Hard-linking does not help with large
files that have only changed in small ways, such as an email database;
differences only find redundancies in adjacent versions of a single file
(consider a section that was deleted and later added in again, or a logo
image included in many documents).
Network data deduplication is used to reduce the number of bytes that
must be transferred between endpoints, which can reduce the amount of
bandwidth required. See WAN optimization for more information.
Virtual servers and virtual desktops benefit from deduplication because
it allows nominally separate system files for each virtual machine to be
coalesced into a single storage space. At the same time, if a given
virtual machine customizes a file, deduplication will not change the
files on the other virtual machines—something that alternatives like hard
links or shared disks do not offer. Backing up or making duplicate copies
of virtual environments is similarly improved.
Deduplication overview
Deduplication may occur "in-line", as data is flowing, or "post-process"
after it has been written.
Post-process deduplication
With post-process deduplication, new data is first stored on the storage
device and then a process at a later time will analyze the data looking
for duplication. The benefit is that there is no need to wait for the
hash calculations and lookup to be completed before storing the data,
thereby ensuring that store performance is not degraded. Implementations
offering policy-based operation can give users the ability to defer
optimization on "active" files, or to process files based on type and
location. One potential drawback is that duplicate data may be
unnecessarily stored for a short time, which can be problematic if the
system is nearing full capacity.
In-line deduplication
Alternatively, deduplication hash calculations can be done synchronized
as data enters the target device. If the storage system identifies a
block which it has already stored, only a reference to the existing block
is stored, rather than the whole new block.
The advantage of in-line deduplication over post-process deduplication is
that it requires less storage, since duplicate data is never stored. On
the negative side, it is frequently argued[by whom?] that because hash
calculations and lookups take so long, data ingestion can be slower,
thereby reducing the backup throughput of the device. However, certain
vendors with in-line deduplication have demonstrated equipment with
similar performance to their post-process deduplication
counterparts.[according to whom?]
Data coming in is stored into "lining space" before it hits real storage
blocks. On SSD disks lining space is provided using NVRAM which is not
cost-efficient.[according to whom?]
Post-process and in-line deduplication methods are often heavily
debated.[2][3]
Data formats
SNIA Dictionary identifies two methods:
content-agnostic data deduplication - a data deduplication method that
does not require awareness of specific application data formats.
content-aware data deduplication - a data deduplication method that
leverages knowledge of specific application data formats.
Source versus target deduplication
Another way to classify data deduplication methods is according to where
they occur. Deduplication occurring close to where data is created, is
often referred to[according to whom?] as "source deduplication". When it
occurs near where the data is stored, it is commonly called "target
deduplication".
Source deduplication ensures that data on the data source is
deduplicated. This generally takes place directly within a file
system.[4][5] The file system will periodically scan new files creating
hashes and compare them to hashes of existing files. When files with same
hashes are found then the file copy is removed and the new file points to
the old file. Unlike hard links however, duplicated files are considered
to be separate entities and if one of the duplicated files is later
modified, then using a system called copy-on-write a copy of that file or
changed block is created. The deduplication process is transparent to the
users and backup applications. Backing up a deduplicated file system will
often cause duplication to occur resulting in the backups being bigger
than the source data.
Target deduplication is the process of removing duplicates when the data
was not generated at that location. Example of this would be a server
connected to a SAN/NAS, The SAN/NAS would be a target for the server
(Target deduplication). The server is not aware of any deduplication, the
server is also the point of data generation.
A second example would be backup. Generally this will be a backup store
such as a data repository or a virtual tape library.
Deduplication methods
One of the most common forms of data deduplication implementations works
by comparing chunks of data to detect duplicates. For that to happen,
each chunk of data is assigned an identification, calculated by the
software, typically using cryptographic hash functions. In many
implementations, the assumption is made that if the identification is
identical, the data is identical, even though this cannot be true in all
cases due to the pigeonhole principle; other implementations do not
assume that two blocks of data with the same identifier are identical,
but actually verify that data with the same identification is
identical.[6] If the software either assumes that a given identification
already exists in the deduplication namespace or actually verifies the
identity of the two blocks of data, depending on the implementation, then
it will replace that duplicate chunk with a link.
Once the data has been deduplicated, upon read back of the file, wherever
a link is found, the system simply replaces that link with the referenced
data chunk. The deduplication process is intended to be transparent to
end users and applications.
Commercial deduplication implementations differ by their chunking methods
and architectures.
Chunking. In some systems, chunks are defined by physical layer
constraints (e.g. 4KB block size in WAFL). In some systems only complete
files are compared, which is called single-instance storage or SIS. The
most intelligent (but CPU intensive) method to chunking is generally
considered to be sliding-block. In sliding block, a window is passed
along the file stream to seek out more naturally occurring internal file
boundaries.
Client backup deduplication. This is the process where the deduplication
hash calculations are initially created on the source (client) machines.
Files that have identical hashes to files already in the target device
are not sent, the target device just creates appropriate internal links
to reference the duplicated data. The benefit of this is that it avoids
data being unnecessarily sent across the network thereby reducing traffic
load.
Primary storage and secondary storage. By definition, primary storage
systems are designed for optimal performance, rather than lowest possible
cost. The design criteria for these systems is to increase performance,
at the expense of other considerations. Moreover, primary storage systems
are much less tolerant of any operation that can negatively impact
performance. Also by definition, secondary storage systems contain
primarily duplicate, or secondary copies of data. These copies of data
are typically not used for actual production operations and as a result
are more tolerant of some performance degradation, in exchange for
increased efficiency.
To date, data deduplication has predominantly been used with secondary
storage systems. The reasons for this are two-fold. First, data
deduplication requires overhead to discover and remove the duplicate
data. In primary storage systems, this overhead may impact performance.
The second reason why deduplication is applied to secondary data, is that
secondary data tends to have more duplicate data. Backup application in
particular commonly generate significant portions of duplicate data over
time.
Data deduplication has been deployed successfully with primary storage in
some cases where the system design does not require significant overhead,
or impact performance.
Drawbacks and concerns
Whenever data is transformed, concerns arise about potential loss of
data. By definition, data deduplication systems store data differently
from how it was written. As a result, users are concerned with the
integrity of their data. The various methods of deduplicating data all
employ slightly different techniques. However, the integrity of the data
will ultimately depend upon the design of the deduplicating system, and
the quality used to implement the algorithms. As the technology has
matured over the past decade, the integrity of most of the major products
has been well proven.[citation needed]
One method for deduplicating data relies on the use of cryptographic hash
functions to identify duplicate segments of data. If two different pieces
of information generate the same hash value, this is known as a
collision. The probability of a collision depends upon the hash function
used, and although the probabilities are small, they are always non zero.
Thus, the concern arises that data corruption can occur if a hash
collision occurs, and additional means of verification are not used to
verify whether there is a difference in data, or not. Both in-line and
post-process architectures may offer bit-for-bit validation of original
data for guaranteed data integrity.[7] The hash functions used include
standards such as SHA-1, SHA-256 and others.
The computational resource intensity of the process can be a drawback of
data deduplication. However, this is rarely an issue for stand-alone
devices or appliances, as the computation is completely offloaded from
other systems. This can be an issue when the deduplication is embedded
within devices providing other services. To improve performance, many
systems utilize both weak and strong hashes. Weak hashes are much faster
to calculate but there is a greater risk of a hash collision. Systems
that utilize weak hashes will subsequently calculate a strong hash and
will use it as the determining factor to whether it is actually the same
data or not. Note that the system overhead associated with calculating
and looking up hash values is primarily a function of the deduplication
workflow. The reconstitution of files does not require this processing
and any incremental performance penalty associated with re-assembly of
data chunks is unlikely to impact application performance.
Another area of concern with deduplication is the related effect on
snapshots, backup, and archival, especially where deduplication is
applied against primary storage (for example inside a NAS filer).[further
explanation needed] Reading files out of a storage device causes full
reconstitution of the files (also known as rehydration), so any secondary
copy of the data set is likely to be larger than the primary copy. In
terms of snapshots, if a file is snapshotted prior to deduplication, the
post-deduplication snapshot will preserve the entire original file. This
means that although storage capacity for primary file copies will shrink,
capacity required for snapshots may expand dramatically.
Another concern is the effect of compression and encryption. Although
deduplication is a version of compression, it works in tension with
traditional compression. Deduplication achieves better efficiency against
smaller data chunks, whereas compression achieves better efficiency
against larger chunks. The goal of encryption is to eliminate any
discernible patterns in the data. Thus encrypted data cannot be
deduplicated, even though the underlying data may be redundant.
Deduplication ultimately reduces redundancy. If this was not expected and
planned for, this may ruin the underlying reliability of the system.
(Compare this, for example, to the LOCKSS storage architecture that
achieves reliability through multiple copies of data.)
Scaling has also been a challenge for deduplication systems because
ideally, the scope of deduplication needs to be shared across storage
devices. If there are multiple disk backup devices in an infrastructure
with discrete deduplication, then space efficiency is adversely affected.
A deduplication shared across devices preserves space efficiency, but is
technically challenging from a reliability and performance
perspective.[citation needed]
Although not a shortcoming of data deduplication, there have been data
breaches[citation needed] when insufficient security and access
validation procedures are used with large repositories of deduplicated
data. In some systems, as typical with cloud storage,[citation needed] an
attacker can retrieve data owned by others by knowing or guessing the
hash value of the desired data.[8]
How does Dedup work?
The data deduplication process works by eliminating redundant data and
ensuring that only the first unique instance of any data is actually
retained. Subsequent iterations of the data are replaced with a pointer
to the original. Data deduplication can operate at the file, block or bit
level.
----------------------------------------------------------
Deduplicación de datos - Español
Ir a la navegaciónIr a la búsqueda
En informática, la deduplicación de datos es una técnica especializada de
compresión de datos para eliminar copias duplicadas de datos repetidos.
Un término relacionado con la deduplicación de datos es la compresión
inteligente de datos. Esta técnica se usa para optimizar el
almacenamiento de datos en disco y también para reducir la cantidad de
información que debe enviarse de un dispositivo a otro a través de redes
de comunicación.
Una aplicación de deduplicación es reducir la cantidad de datos al crear
copias de seguridad de sistemas grandes.
----------------------------------------------------------
Déduplication - Français
En informatique, la déduplication (également appelée factorisation ou
stockage d'instance unique) est une technique de stockage de données,
consistant à factoriser des séquences de données identiques afin
d'économiser l'espace utilisé.
Chaque fichier est découpé en une multitude de tronçons. À chacun de ces
tronçons est associé un identifiant unique, ces identifiants étant
stockés dans un index. L'objectif de la déduplication est de ne stocker
qu'une seule fois un même tronçon. Aussi, une nouvelle occurrence d'un
tronçon déjà présent n'est pas à nouveau sauvegardée, mais remplacée par
un pointeur vers l'identifiant correspondant.
La déduplication est utilisée en particulier sur des solutions du type
VTL (Virtual Tape Library) ou tout autre type de système de sauvegarde.
Méthodes de déduplication
Déduplication hors ligne
Les données à sauvegarder sont recopiées sur un espace disque tampon, et
dans un deuxième temps une recherche des blocs en double est réalisée.
Cette méthode nécessite un espace de stockage important. C'est le
principe des solutions Falconstor ou Quantum DXi en firmware 1.x par
exemple.
Déduplication en ligne
Les données à sauvegarder sont analysées "à la volée", et une table
d'index des blocs identiques est gérée (solution NexentaStor de Nexenta
Systems, Data Domain de EMC Corporation ou IBM ProtecTIER)1.
Déduplication à la source
Des agents répartis sur les serveurs à sauvegarder analysent les données
à la source (solution EMC Avamar notamment)1.
Principe
L'index créé lors de la sauvegarde est utilisé pour restituer les données
au bon endroit. Les fichiers ou les blocs en double dans l'index sont
dupliqués au moment de la restauration. L'expérience montre qu'en
pratique le taux de déduplication augmente dans le temps, car en pratique
peu de données changent entre deux sauvegardes totales. D'autre part le
taux de réduction obtenu dépend fortement du type de données traitées2.
Inconvénients de la déduplication
Risque de perte de données car les données ne sont pas en double et donc
le support utilisé doit être fiable. La réduction de la taille des
sauvegardes est un avantage par rapport à d'autres types de sauvegarde,
mais au détriment de la sécurité des données. Par conséquent, il est
recommandé de créer des doubles des supports de stockage.
Perte du format d'origine, ce qui dans certains cas pose des problèmes de
conformité aux contraintes légales (par exemple Bâle II). Certaines
solutions proposent pour cela de générer les données sensibles sur
cartouche au format initial, pour s'affranchir d'une éventuelle
défaillance de la VTL par exemple.
Avantage de la déduplication
L'avantage le plus important est la réduction d'espace occupé par les
sauvegardes : selon le cabinet Gartner, cette technologie permet de
diviser par 20 voire par 30 les besoins en espace de stockage3.
Un avantage indirect, conséquence du précédent, est la diminution de la
bande passante nécessaire à la sauvegarde dans le cas de la déduplication
à la source
----------------------------------------------------------
Deduplikation - Deutsch
Deduplikation (aus englisch deduplication), auch Datendeduplikation oder
Deduplizierung, ist in der Informationstechnik ein Prozess, der
redundante Daten identifiziert (Duplikaterkennung) und eliminiert, bevor
diese auf einen nichtflüchtigen Datenträger geschrieben werden. Der
Prozess komprimiert wie andere Verfahren auch die Datenmenge, die von
einem Sender an einen Empfänger geschickt wird. Es ist nahezu unmöglich,
die Effizienz bei der Verwendung von Deduplikations-Algorithmen
vorherzusagen, da sie immer von der Datenstruktur und der Änderungsrate
abhängig ist. Deduplikation kann eine sehr effiziente Methode sein,
Datenmengen zu reduzieren, bei denen eine Mustererkennung möglich ist
(unverschlüsselte Daten).
Vorrangiges Einsatzgebiet der Deduplikation ist vorerst die
Datensicherung (Backup), bei der sich in der Praxis meistens eine
stärkere Datenkomprimierung als mit anderen Methoden erzielen lässt. Das
Verfahren eignet sich grundsätzlich für jeden Einsatzbereich, bei dem
Daten wiederholt kopiert werden.
Funktionsweise
Deduplikations-Systeme unterteilen die Dateien in Blöcke gleicher Größe
(meist Zweierpotenzen) und berechnen für jeden Block eine Prüfsumme.
Hierin liegt auch die Abgrenzung zum Single Instance Storage (SIS), das
identische Dateien eliminieren soll (siehe auch inhaltsadressierte
Speichersysteme, CAS).
Alle Prüfsummen werden anschließend zusammen mit einem Verweis auf die
entsprechende Datei und die Position innerhalb der Datei gespeichert.
Kommt eine neue Datei hinzu, so wird auch ihr Inhalt in Blöcke unterteilt
und daraus die Prüfsummen berechnet. Anschließend wird verglichen, ob
eine Prüfsumme bereits existiert. Dieser Vergleich der Prüfsummen ist
wesentlich schneller, als die Dateiinhalte direkt miteinander zu
vergleichen. Wird eine identische Prüfsumme gefunden, ist dies ein
Hinweis darauf, dass möglicherweise ein identischer Datenblock gefunden
wurde, es muss allerdings noch geprüft werden, ob die Inhalte tatsächlich
identisch sind, da es sich auch um eine Kollision handeln kann.
Wurde ein identischer Datenblock gefunden, wird einer der Blöcke entfernt
und stattdessen nur ein Verweis auf den anderen Datenblock gespeichert.
Dieser Verweis benötigt weniger Speicherplatz, als der Block selbst.
Für die Selektion der Blöcke gibt es zwei Methoden. Beim "Reverse-
Referencing" wird der erste gemeinsame Block gespeichert, alle weiteren
identischen erhalten einen Verweis auf den ersten. Das "Forward-
Referencing" legt immer den zuletzt aufgetretenen gemeinsamen Datenblock
ab und referenziert die vorher aufgetretenen Elemente. Bei diesem
Methodenstreit geht es darum, ob Daten schneller gespeichert oder
schneller wiederhergestellt werden sollen. Weitere Vorgehensweisen, wie
"Inband" und "Outband", konkurrieren darum, ob der Datenstrom "on the
fly", also im laufenden Betrieb, analysiert wird oder erst nachdem dieser
am Zielort gespeichert worden ist. Im ersten Fall darf nur ein Datenstrom
existieren, im zweiten können die Daten mittels mehrerer Datenströme
parallel untersucht werden.
Beispiel
Bei der Datensicherung von Festplatten auf Bandmedien ist das Verhältnis
von neuen bzw. veränderten zu unveränderten Daten zwischen zwei
Vollsicherungen meist nur relativ gering. Zwei Vollsicherungen benötigen
bei der klassischen Datensicherung aber trotzdem mindestens die doppelte
Speicherkapazität auf dem Band, verglichen mit den Originaldaten. Die
Deduplikation erkennt die identischen Datenbestandteile. In einer Liste
werden dazu eindeutige Segmente festgehalten, und beim erneuten Auftreten
dieses Datenteils werden Zeitpunkt und Ort im Datenstrom notiert, so dass
letztlich die Originaldaten wiederhergestellt werden können.
Allerdings handelt es sich damit nicht mehr um voneinander unabhängige
Vollsicherungen, d h. dass der Verlust eines Versionsstandes zu
unwiederbringlichem Datenverlust führt. Deduplikation verzichtet somit,
ähnlich der inkrementellen Sicherungen, auf Datensicherheit und zugunsten
des Speicherbedarfs.
Chunking (Fingerprinting)
Das Fingerprinting versucht auch festzustellen, wie der eingehende
Datenstrom am besten in Stücke zerlegt werden kann, so dass möglichst
viele identische Datenblöcke entstehen. Dieser Vorgang heißt Chunking
(von englisch chunk ‚Stück‘, ‚Block‘).
Identifikation von Blöcken
Je genauer die Änderungen einer Datei bestimmt werden können, desto
weniger muss redundant gesichert werden. Allerdings vergrößert sich
dadurch der Index, also der Bauplan, wie und aus welchen Bestandteilen
die Datei beim Aufruf wieder zusammengesetzt wird. Daher ist auch die
Methode der Identifikation von gemeinsamen Blöcken entscheidend.
----------------------------------------------------------
Д -
Д ( ; . deduplicatio —
) — ,
. Д
,
.
В
( . chunks).
.
,
( ),
, , .
, ё ,
, ё .
, LZ77 LZO. Э
ё ( « »),
.
Д ё
. ,
,
,
.
, ё
ё .
, ( ,
)
.
В
,
.
,
.
----------------------------------------------------------
Datadeduplisering - Norsk
Datadeduplisering er innenfor informatikken navnet på en spesialisert
datakompresjonsteknikk som eliminerer duplikate kopier av repeterende
data. Beslektede og tildels synonyme benevnelser er intelligent
(data)kompresjon og enkeltinstanslagring. Teknikken brukes for å forbedre
utnyttelsen av lagringsmedia og kan også anvendes på dataoverføring i et
datanett for å redusere antall byte som må sendes. Under deduplisering
blir unike mengder med data, eller byte-mønstre, identifisert og lagret i
en prosess av analyse. Etterhvert som analysen fortsetter, blir andre
datamengder sammenlignet med den lagrede kopi og når en match inntreffer,
blir den redundante datamengden erstattet med en liten referanse som
peker mot den lagrede datamengden. Ettersom samme datamengde kan
inntreffe dusinvis, hundrevis eller tusenvis av ganger, kan den lagrede
eller overførte datamengde reduseres betraktelig
----------------------------------------------------------
Д
Д ( . deduplicatio — ) — ,
.
' .
В ,
. , ,
. З , є ,
,
.
В є .
є є
:
є, є ,
.
є, є .
, ,
.
Д є ,
є ,
, .
є
—
є.
Д є :
, , ZFS[1], IPFS[2];
— є ;
— є
snapshot- ,
;
— є
.
:
;
;
.
ZFS[3]. Д
RLE.
Д ,
є GNU- fdupes. В є
.
,
[4]:
( , SHA-1, SHA-256, MD5) ;
;
(fletcher4) є .
----------------------------------------------------------
Deduplikace - Čeština
Zjednodušené schema deduplikace
Deduplikace je speciální technika komprese dat, která zabraňuje ukládání
stejných datových bloků na jednom úložišti. Deduplikační jednotka ukládá
informace (referenční informace) o datové struktuře a díky tomu je
schopná při zpětném čtení deduplikovaných dat zpět obnovit původní,
komplexní informaci. Účelem deduplikace je úspora místa na datovém
úložišti. Kromě této varianty, tzv. blokové deduplikace, existuje ještě
deduplikace na úrovni souborů, kdy je ukládána pouze jedna kopie
(instance) souboru/přílohy e-mailu. Příkladem budiž ukládání e-mailových
zpráv v systému Microsoft Exchange[1], nebo Windows Single Instance
Storage[2].
eDuplication - English
Deduplikace - Čeština
Deduplikation - Deutsch
Deduplicación Español
Déduplication - Français
Datadeduplisering - Norsk
Д -
Д -
DeDuplication , Online - Text Tools
http://www.text-filter.com/tools

Weitere ähnliche Inhalte

Was ist angesagt?

ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESneirew J
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Samsung Business USA
 
Distributed processing
Distributed processingDistributed processing
Distributed processingNeil Stein
 
An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...Alexander Decker
 
NoSql And The Semantic Web
NoSql And The Semantic WebNoSql And The Semantic Web
NoSql And The Semantic WebIrina Hutanu
 
A hybrid cloud approach for secure authorized deduplication
A hybrid cloud approach for secure authorized deduplicationA hybrid cloud approach for secure authorized deduplication
A hybrid cloud approach for secure authorized deduplicationTmks Infotech
 
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file systemJohn Veigas
 
A request skew aware heterogeneous distributed
A request skew aware heterogeneous distributedA request skew aware heterogeneous distributed
A request skew aware heterogeneous distributedJoão Gabriel Lima
 
Deduplication on Encrypted Big Data in HDFS
Deduplication on Encrypted Big Data in HDFSDeduplication on Encrypted Big Data in HDFS
Deduplication on Encrypted Big Data in HDFSIRJET Journal
 
EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperEOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperDavid Walker
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiUnmesh Baile
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Survey of clustered_parallel_file_systems_004_lanl.ppt
Survey of clustered_parallel_file_systems_004_lanl.pptSurvey of clustered_parallel_file_systems_004_lanl.ppt
Survey of clustered_parallel_file_systems_004_lanl.pptRohn Wood
 
Nas and san
Nas and sanNas and san
Nas and sansongaco
 

Was ist angesagt? (18)

ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
 
Distributed processing
Distributed processingDistributed processing
Distributed processing
 
An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...
 
NoSql And The Semantic Web
NoSql And The Semantic WebNoSql And The Semantic Web
NoSql And The Semantic Web
 
A hybrid cloud approach for secure authorized deduplication
A hybrid cloud approach for secure authorized deduplicationA hybrid cloud approach for secure authorized deduplication
A hybrid cloud approach for secure authorized deduplication
 
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file system
 
A request skew aware heterogeneous distributed
A request skew aware heterogeneous distributedA request skew aware heterogeneous distributed
A request skew aware heterogeneous distributed
 
Cloud storage
Cloud storageCloud storage
Cloud storage
 
Deduplication on Encrypted Big Data in HDFS
Deduplication on Encrypted Big Data in HDFSDeduplication on Encrypted Big Data in HDFS
Deduplication on Encrypted Big Data in HDFS
 
Hdfs
HdfsHdfs
Hdfs
 
EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperEOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - Paper
 
SiDe Enabled Reliable Replica Optimization
SiDe Enabled Reliable Replica OptimizationSiDe Enabled Reliable Replica Optimization
SiDe Enabled Reliable Replica Optimization
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbai
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Survey of clustered_parallel_file_systems_004_lanl.ppt
Survey of clustered_parallel_file_systems_004_lanl.pptSurvey of clustered_parallel_file_systems_004_lanl.ppt
Survey of clustered_parallel_file_systems_004_lanl.ppt
 
Nas and san
Nas and sanNas and san
Nas and san
 
Storage
StorageStorage
Storage
 

Ähnlich wie Deduplication - Remove Duplicate

Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...IRJET Journal
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESijccsa
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESijccsa
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESijccsa
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 
Secure distributed deduplication systems with improved reliability 2
Secure distributed deduplication systems with improved reliability 2Secure distributed deduplication systems with improved reliability 2
Secure distributed deduplication systems with improved reliability 2Rishikesh Pathak
 
Scalable and adaptive data replica placement for geo distributed cloud storages
Scalable and adaptive data replica placement for geo distributed cloud storagesScalable and adaptive data replica placement for geo distributed cloud storages
Scalable and adaptive data replica placement for geo distributed cloud storagesVenkat Projects
 
Survey of distributed storage system
Survey of distributed storage systemSurvey of distributed storage system
Survey of distributed storage systemZhichao Liang
 
[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage Reduction[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage ReductionPerforce
 
a hybrid cloud approach for secure authorized
a hybrid cloud approach for secure authorizeda hybrid cloud approach for secure authorized
a hybrid cloud approach for secure authorizedlogicsystemsprojects
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...dbpublications
 
Google File System
Google File SystemGoogle File System
Google File Systemvivatechijri
 
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUPEVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUPijdms
 
Doc A hybrid cloud approach for secure authorized deduplication
 Doc A hybrid cloud approach for secure authorized deduplication Doc A hybrid cloud approach for secure authorized deduplication
Doc A hybrid cloud approach for secure authorized deduplicationShakas Technologie
 
Compare Array vs Host vs Hypervisor vs Network-Based Replication
Compare Array vs Host vs Hypervisor vs Network-Based ReplicationCompare Array vs Host vs Hypervisor vs Network-Based Replication
Compare Array vs Host vs Hypervisor vs Network-Based ReplicationMaryJWilliams2
 
Distributed database
Distributed databaseDistributed database
Distributed databasesanjay joshi
 

Ähnlich wie Deduplication - Remove Duplicate (20)

Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Distributed file systems dfs
Distributed file systems   dfsDistributed file systems   dfs
Distributed file systems dfs
 
Secure distributed deduplication systems with improved reliability 2
Secure distributed deduplication systems with improved reliability 2Secure distributed deduplication systems with improved reliability 2
Secure distributed deduplication systems with improved reliability 2
 
Document 22.pdf
Document 22.pdfDocument 22.pdf
Document 22.pdf
 
Scalable and adaptive data replica placement for geo distributed cloud storages
Scalable and adaptive data replica placement for geo distributed cloud storagesScalable and adaptive data replica placement for geo distributed cloud storages
Scalable and adaptive data replica placement for geo distributed cloud storages
 
Survey of distributed storage system
Survey of distributed storage systemSurvey of distributed storage system
Survey of distributed storage system
 
[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage Reduction[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage Reduction
 
Operating system
Operating systemOperating system
Operating system
 
a hybrid cloud approach for secure authorized
a hybrid cloud approach for secure authorizeda hybrid cloud approach for secure authorized
a hybrid cloud approach for secure authorized
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
 
Google File System
Google File SystemGoogle File System
Google File System
 
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUPEVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
EVALUATE DATABASE COMPRESSION PERFORMANCE AND PARALLEL BACKUP
 
Doc A hybrid cloud approach for secure authorized deduplication
 Doc A hybrid cloud approach for secure authorized deduplication Doc A hybrid cloud approach for secure authorized deduplication
Doc A hybrid cloud approach for secure authorized deduplication
 
Compare Array vs Host vs Hypervisor vs Network-Based Replication
Compare Array vs Host vs Hypervisor vs Network-Based ReplicationCompare Array vs Host vs Hypervisor vs Network-Based Replication
Compare Array vs Host vs Hypervisor vs Network-Based Replication
 
Distributed database
Distributed databaseDistributed database
Distributed database
 
Data replication
Data replicationData replication
Data replication
 

Kürzlich hochgeladen

This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 

Kürzlich hochgeladen (20)

This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 

Deduplication - Remove Duplicate

  • 1.
  • 2. Data deDuplication - deDup deDuplication - English Deduplikace - Čeština Deduplikation - Deutsch Deduplicación Español Déduplication - Français Datadeduplisering - Norsk Д - Д - ttp://www.text-filter.com/tools/remove-duplicate-lines/ In computing, data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data. Related and somewhat synonymous terms are intelligent (data) compression and single- instance (data) storage. This technique is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent. In the deduplication process, unique chunks of data, or byte patterns, are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is dependent on the chunk size), the amount of data that must be stored or transferred can be greatly reduced.[1] This type of deduplication is different from that performed by standard file-compression tools, such as LZ77 and LZ78. Whereas these tools identify short repeated substrings inside individual files, the intent of storage-based data deduplication is to inspect large volumes of data and identify large sections – such as entire files or large sections of files – that are identical, in order to store only one copy of it. This copy may be additionally compressed by single-file compression techniques. For example, a typical email system might contain 100 instances of the same 1 MB (megabyte) file attachment. Each time the email platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; the subsequent instances are referenced back to the saved copy for deduplication ratio of roughly 100 to 1. Storage-based data deduplication reduces the amount of storage needed for a given set of files. It is most effective in applications where many copies of very similar or even identical data are stored on a single disk—a surprisingly common scenario. In the case of data backups, which routinely are performed to protect against data loss, most data in a given backup remain unchanged from the previous backup. Common backup systems try to exploit this by omitting (or hard linking) files that haven't changed or storing differences between files. Neither approach captures all redundancies, however. Hard-linking does not help with large files that have only changed in small ways, such as an email database; differences only find redundancies in adjacent versions of a single file (consider a section that was deleted and later added in again, or a logo image included in many documents).
  • 3. Network data deduplication is used to reduce the number of bytes that must be transferred between endpoints, which can reduce the amount of bandwidth required. See WAN optimization for more information. Virtual servers and virtual desktops benefit from deduplication because it allows nominally separate system files for each virtual machine to be coalesced into a single storage space. At the same time, if a given virtual machine customizes a file, deduplication will not change the files on the other virtual machines—something that alternatives like hard links or shared disks do not offer. Backing up or making duplicate copies of virtual environments is similarly improved. Deduplication overview Deduplication may occur "in-line", as data is flowing, or "post-process" after it has been written. Post-process deduplication With post-process deduplication, new data is first stored on the storage device and then a process at a later time will analyze the data looking for duplication. The benefit is that there is no need to wait for the hash calculations and lookup to be completed before storing the data, thereby ensuring that store performance is not degraded. Implementations offering policy-based operation can give users the ability to defer optimization on "active" files, or to process files based on type and location. One potential drawback is that duplicate data may be unnecessarily stored for a short time, which can be problematic if the system is nearing full capacity. In-line deduplication Alternatively, deduplication hash calculations can be done synchronized as data enters the target device. If the storage system identifies a block which it has already stored, only a reference to the existing block is stored, rather than the whole new block. The advantage of in-line deduplication over post-process deduplication is that it requires less storage, since duplicate data is never stored. On the negative side, it is frequently argued[by whom?] that because hash calculations and lookups take so long, data ingestion can be slower, thereby reducing the backup throughput of the device. However, certain vendors with in-line deduplication have demonstrated equipment with similar performance to their post-process deduplication counterparts.[according to whom?] Data coming in is stored into "lining space" before it hits real storage blocks. On SSD disks lining space is provided using NVRAM which is not cost-efficient.[according to whom?] Post-process and in-line deduplication methods are often heavily debated.[2][3] Data formats SNIA Dictionary identifies two methods: content-agnostic data deduplication - a data deduplication method that does not require awareness of specific application data formats. content-aware data deduplication - a data deduplication method that leverages knowledge of specific application data formats. Source versus target deduplication Another way to classify data deduplication methods is according to where they occur. Deduplication occurring close to where data is created, is
  • 4. often referred to[according to whom?] as "source deduplication". When it occurs near where the data is stored, it is commonly called "target deduplication". Source deduplication ensures that data on the data source is deduplicated. This generally takes place directly within a file system.[4][5] The file system will periodically scan new files creating hashes and compare them to hashes of existing files. When files with same hashes are found then the file copy is removed and the new file points to the old file. Unlike hard links however, duplicated files are considered to be separate entities and if one of the duplicated files is later modified, then using a system called copy-on-write a copy of that file or changed block is created. The deduplication process is transparent to the users and backup applications. Backing up a deduplicated file system will often cause duplication to occur resulting in the backups being bigger than the source data. Target deduplication is the process of removing duplicates when the data was not generated at that location. Example of this would be a server connected to a SAN/NAS, The SAN/NAS would be a target for the server (Target deduplication). The server is not aware of any deduplication, the server is also the point of data generation. A second example would be backup. Generally this will be a backup store such as a data repository or a virtual tape library. Deduplication methods One of the most common forms of data deduplication implementations works by comparing chunks of data to detect duplicates. For that to happen, each chunk of data is assigned an identification, calculated by the software, typically using cryptographic hash functions. In many implementations, the assumption is made that if the identification is identical, the data is identical, even though this cannot be true in all cases due to the pigeonhole principle; other implementations do not assume that two blocks of data with the same identifier are identical, but actually verify that data with the same identification is identical.[6] If the software either assumes that a given identification already exists in the deduplication namespace or actually verifies the identity of the two blocks of data, depending on the implementation, then it will replace that duplicate chunk with a link. Once the data has been deduplicated, upon read back of the file, wherever a link is found, the system simply replaces that link with the referenced data chunk. The deduplication process is intended to be transparent to end users and applications. Commercial deduplication implementations differ by their chunking methods and architectures. Chunking. In some systems, chunks are defined by physical layer constraints (e.g. 4KB block size in WAFL). In some systems only complete files are compared, which is called single-instance storage or SIS. The most intelligent (but CPU intensive) method to chunking is generally considered to be sliding-block. In sliding block, a window is passed along the file stream to seek out more naturally occurring internal file boundaries. Client backup deduplication. This is the process where the deduplication hash calculations are initially created on the source (client) machines. Files that have identical hashes to files already in the target device are not sent, the target device just creates appropriate internal links to reference the duplicated data. The benefit of this is that it avoids
  • 5. data being unnecessarily sent across the network thereby reducing traffic load. Primary storage and secondary storage. By definition, primary storage systems are designed for optimal performance, rather than lowest possible cost. The design criteria for these systems is to increase performance, at the expense of other considerations. Moreover, primary storage systems are much less tolerant of any operation that can negatively impact performance. Also by definition, secondary storage systems contain primarily duplicate, or secondary copies of data. These copies of data are typically not used for actual production operations and as a result are more tolerant of some performance degradation, in exchange for increased efficiency. To date, data deduplication has predominantly been used with secondary storage systems. The reasons for this are two-fold. First, data deduplication requires overhead to discover and remove the duplicate data. In primary storage systems, this overhead may impact performance. The second reason why deduplication is applied to secondary data, is that secondary data tends to have more duplicate data. Backup application in particular commonly generate significant portions of duplicate data over time. Data deduplication has been deployed successfully with primary storage in some cases where the system design does not require significant overhead, or impact performance. Drawbacks and concerns Whenever data is transformed, concerns arise about potential loss of data. By definition, data deduplication systems store data differently from how it was written. As a result, users are concerned with the integrity of their data. The various methods of deduplicating data all employ slightly different techniques. However, the integrity of the data will ultimately depend upon the design of the deduplicating system, and the quality used to implement the algorithms. As the technology has matured over the past decade, the integrity of most of the major products has been well proven.[citation needed] One method for deduplicating data relies on the use of cryptographic hash functions to identify duplicate segments of data. If two different pieces of information generate the same hash value, this is known as a collision. The probability of a collision depends upon the hash function used, and although the probabilities are small, they are always non zero. Thus, the concern arises that data corruption can occur if a hash collision occurs, and additional means of verification are not used to verify whether there is a difference in data, or not. Both in-line and post-process architectures may offer bit-for-bit validation of original data for guaranteed data integrity.[7] The hash functions used include standards such as SHA-1, SHA-256 and others. The computational resource intensity of the process can be a drawback of data deduplication. However, this is rarely an issue for stand-alone devices or appliances, as the computation is completely offloaded from other systems. This can be an issue when the deduplication is embedded within devices providing other services. To improve performance, many systems utilize both weak and strong hashes. Weak hashes are much faster to calculate but there is a greater risk of a hash collision. Systems that utilize weak hashes will subsequently calculate a strong hash and will use it as the determining factor to whether it is actually the same data or not. Note that the system overhead associated with calculating and looking up hash values is primarily a function of the deduplication workflow. The reconstitution of files does not require this processing
  • 6. and any incremental performance penalty associated with re-assembly of data chunks is unlikely to impact application performance. Another area of concern with deduplication is the related effect on snapshots, backup, and archival, especially where deduplication is applied against primary storage (for example inside a NAS filer).[further explanation needed] Reading files out of a storage device causes full reconstitution of the files (also known as rehydration), so any secondary copy of the data set is likely to be larger than the primary copy. In terms of snapshots, if a file is snapshotted prior to deduplication, the post-deduplication snapshot will preserve the entire original file. This means that although storage capacity for primary file copies will shrink, capacity required for snapshots may expand dramatically. Another concern is the effect of compression and encryption. Although deduplication is a version of compression, it works in tension with traditional compression. Deduplication achieves better efficiency against smaller data chunks, whereas compression achieves better efficiency against larger chunks. The goal of encryption is to eliminate any discernible patterns in the data. Thus encrypted data cannot be deduplicated, even though the underlying data may be redundant. Deduplication ultimately reduces redundancy. If this was not expected and planned for, this may ruin the underlying reliability of the system. (Compare this, for example, to the LOCKSS storage architecture that achieves reliability through multiple copies of data.) Scaling has also been a challenge for deduplication systems because ideally, the scope of deduplication needs to be shared across storage devices. If there are multiple disk backup devices in an infrastructure with discrete deduplication, then space efficiency is adversely affected. A deduplication shared across devices preserves space efficiency, but is technically challenging from a reliability and performance perspective.[citation needed] Although not a shortcoming of data deduplication, there have been data breaches[citation needed] when insufficient security and access validation procedures are used with large repositories of deduplicated data. In some systems, as typical with cloud storage,[citation needed] an attacker can retrieve data owned by others by knowing or guessing the hash value of the desired data.[8] How does Dedup work? The data deduplication process works by eliminating redundant data and ensuring that only the first unique instance of any data is actually retained. Subsequent iterations of the data are replaced with a pointer to the original. Data deduplication can operate at the file, block or bit level. ---------------------------------------------------------- Deduplicación de datos - Español Ir a la navegaciónIr a la búsqueda En informática, la deduplicación de datos es una técnica especializada de compresión de datos para eliminar copias duplicadas de datos repetidos. Un término relacionado con la deduplicación de datos es la compresión inteligente de datos. Esta técnica se usa para optimizar el almacenamiento de datos en disco y también para reducir la cantidad de
  • 7. información que debe enviarse de un dispositivo a otro a través de redes de comunicación. Una aplicación de deduplicación es reducir la cantidad de datos al crear copias de seguridad de sistemas grandes. ---------------------------------------------------------- Déduplication - Français En informatique, la déduplication (également appelée factorisation ou stockage d'instance unique) est une technique de stockage de données, consistant à factoriser des séquences de données identiques afin d'économiser l'espace utilisé. Chaque fichier est découpé en une multitude de tronçons. À chacun de ces tronçons est associé un identifiant unique, ces identifiants étant stockés dans un index. L'objectif de la déduplication est de ne stocker qu'une seule fois un même tronçon. Aussi, une nouvelle occurrence d'un tronçon déjà présent n'est pas à nouveau sauvegardée, mais remplacée par un pointeur vers l'identifiant correspondant. La déduplication est utilisée en particulier sur des solutions du type VTL (Virtual Tape Library) ou tout autre type de système de sauvegarde. Méthodes de déduplication Déduplication hors ligne Les données à sauvegarder sont recopiées sur un espace disque tampon, et dans un deuxième temps une recherche des blocs en double est réalisée. Cette méthode nécessite un espace de stockage important. C'est le principe des solutions Falconstor ou Quantum DXi en firmware 1.x par exemple. Déduplication en ligne Les données à sauvegarder sont analysées "à la volée", et une table d'index des blocs identiques est gérée (solution NexentaStor de Nexenta Systems, Data Domain de EMC Corporation ou IBM ProtecTIER)1. Déduplication à la source Des agents répartis sur les serveurs à sauvegarder analysent les données à la source (solution EMC Avamar notamment)1. Principe L'index créé lors de la sauvegarde est utilisé pour restituer les données au bon endroit. Les fichiers ou les blocs en double dans l'index sont dupliqués au moment de la restauration. L'expérience montre qu'en pratique le taux de déduplication augmente dans le temps, car en pratique peu de données changent entre deux sauvegardes totales. D'autre part le taux de réduction obtenu dépend fortement du type de données traitées2. Inconvénients de la déduplication Risque de perte de données car les données ne sont pas en double et donc le support utilisé doit être fiable. La réduction de la taille des sauvegardes est un avantage par rapport à d'autres types de sauvegarde, mais au détriment de la sécurité des données. Par conséquent, il est recommandé de créer des doubles des supports de stockage. Perte du format d'origine, ce qui dans certains cas pose des problèmes de conformité aux contraintes légales (par exemple Bâle II). Certaines solutions proposent pour cela de générer les données sensibles sur
  • 8. cartouche au format initial, pour s'affranchir d'une éventuelle défaillance de la VTL par exemple. Avantage de la déduplication L'avantage le plus important est la réduction d'espace occupé par les sauvegardes : selon le cabinet Gartner, cette technologie permet de diviser par 20 voire par 30 les besoins en espace de stockage3. Un avantage indirect, conséquence du précédent, est la diminution de la bande passante nécessaire à la sauvegarde dans le cas de la déduplication à la source ---------------------------------------------------------- Deduplikation - Deutsch Deduplikation (aus englisch deduplication), auch Datendeduplikation oder Deduplizierung, ist in der Informationstechnik ein Prozess, der redundante Daten identifiziert (Duplikaterkennung) und eliminiert, bevor diese auf einen nichtflüchtigen Datenträger geschrieben werden. Der Prozess komprimiert wie andere Verfahren auch die Datenmenge, die von einem Sender an einen Empfänger geschickt wird. Es ist nahezu unmöglich, die Effizienz bei der Verwendung von Deduplikations-Algorithmen vorherzusagen, da sie immer von der Datenstruktur und der Änderungsrate abhängig ist. Deduplikation kann eine sehr effiziente Methode sein, Datenmengen zu reduzieren, bei denen eine Mustererkennung möglich ist (unverschlüsselte Daten). Vorrangiges Einsatzgebiet der Deduplikation ist vorerst die Datensicherung (Backup), bei der sich in der Praxis meistens eine stärkere Datenkomprimierung als mit anderen Methoden erzielen lässt. Das Verfahren eignet sich grundsätzlich für jeden Einsatzbereich, bei dem Daten wiederholt kopiert werden. Funktionsweise Deduplikations-Systeme unterteilen die Dateien in Blöcke gleicher Größe (meist Zweierpotenzen) und berechnen für jeden Block eine Prüfsumme. Hierin liegt auch die Abgrenzung zum Single Instance Storage (SIS), das identische Dateien eliminieren soll (siehe auch inhaltsadressierte Speichersysteme, CAS). Alle Prüfsummen werden anschließend zusammen mit einem Verweis auf die entsprechende Datei und die Position innerhalb der Datei gespeichert. Kommt eine neue Datei hinzu, so wird auch ihr Inhalt in Blöcke unterteilt und daraus die Prüfsummen berechnet. Anschließend wird verglichen, ob eine Prüfsumme bereits existiert. Dieser Vergleich der Prüfsummen ist wesentlich schneller, als die Dateiinhalte direkt miteinander zu vergleichen. Wird eine identische Prüfsumme gefunden, ist dies ein Hinweis darauf, dass möglicherweise ein identischer Datenblock gefunden wurde, es muss allerdings noch geprüft werden, ob die Inhalte tatsächlich identisch sind, da es sich auch um eine Kollision handeln kann. Wurde ein identischer Datenblock gefunden, wird einer der Blöcke entfernt und stattdessen nur ein Verweis auf den anderen Datenblock gespeichert. Dieser Verweis benötigt weniger Speicherplatz, als der Block selbst. Für die Selektion der Blöcke gibt es zwei Methoden. Beim "Reverse- Referencing" wird der erste gemeinsame Block gespeichert, alle weiteren identischen erhalten einen Verweis auf den ersten. Das "Forward- Referencing" legt immer den zuletzt aufgetretenen gemeinsamen Datenblock ab und referenziert die vorher aufgetretenen Elemente. Bei diesem
  • 9. Methodenstreit geht es darum, ob Daten schneller gespeichert oder schneller wiederhergestellt werden sollen. Weitere Vorgehensweisen, wie "Inband" und "Outband", konkurrieren darum, ob der Datenstrom "on the fly", also im laufenden Betrieb, analysiert wird oder erst nachdem dieser am Zielort gespeichert worden ist. Im ersten Fall darf nur ein Datenstrom existieren, im zweiten können die Daten mittels mehrerer Datenströme parallel untersucht werden. Beispiel Bei der Datensicherung von Festplatten auf Bandmedien ist das Verhältnis von neuen bzw. veränderten zu unveränderten Daten zwischen zwei Vollsicherungen meist nur relativ gering. Zwei Vollsicherungen benötigen bei der klassischen Datensicherung aber trotzdem mindestens die doppelte Speicherkapazität auf dem Band, verglichen mit den Originaldaten. Die Deduplikation erkennt die identischen Datenbestandteile. In einer Liste werden dazu eindeutige Segmente festgehalten, und beim erneuten Auftreten dieses Datenteils werden Zeitpunkt und Ort im Datenstrom notiert, so dass letztlich die Originaldaten wiederhergestellt werden können. Allerdings handelt es sich damit nicht mehr um voneinander unabhängige Vollsicherungen, d h. dass der Verlust eines Versionsstandes zu unwiederbringlichem Datenverlust führt. Deduplikation verzichtet somit, ähnlich der inkrementellen Sicherungen, auf Datensicherheit und zugunsten des Speicherbedarfs. Chunking (Fingerprinting) Das Fingerprinting versucht auch festzustellen, wie der eingehende Datenstrom am besten in Stücke zerlegt werden kann, so dass möglichst viele identische Datenblöcke entstehen. Dieser Vorgang heißt Chunking (von englisch chunk ‚Stück‘, ‚Block‘). Identifikation von Blöcken Je genauer die Änderungen einer Datei bestimmt werden können, desto weniger muss redundant gesichert werden. Allerdings vergrößert sich dadurch der Index, also der Bauplan, wie und aus welchen Bestandteilen die Datei beim Aufruf wieder zusammengesetzt wird. Daher ist auch die Methode der Identifikation von gemeinsamen Blöcken entscheidend. ---------------------------------------------------------- Д - Д ( ; . deduplicatio — ) — , . Д , . В ( . chunks). . , ( ), , , . , ё , , ё .
  • 10. , LZ77 LZO. Э ё ( « »), . Д ё . , , , . , ё ё . , ( , ) . В , . , . ---------------------------------------------------------- Datadeduplisering - Norsk Datadeduplisering er innenfor informatikken navnet på en spesialisert datakompresjonsteknikk som eliminerer duplikate kopier av repeterende data. Beslektede og tildels synonyme benevnelser er intelligent (data)kompresjon og enkeltinstanslagring. Teknikken brukes for å forbedre utnyttelsen av lagringsmedia og kan også anvendes på dataoverføring i et datanett for å redusere antall byte som må sendes. Under deduplisering blir unike mengder med data, eller byte-mønstre, identifisert og lagret i en prosess av analyse. Etterhvert som analysen fortsetter, blir andre datamengder sammenlignet med den lagrede kopi og når en match inntreffer, blir den redundante datamengden erstattet med en liten referanse som peker mot den lagrede datamengden. Ettersom samme datamengde kan inntreffe dusinvis, hundrevis eller tusenvis av ganger, kan den lagrede eller overførte datamengde reduseres betraktelig ---------------------------------------------------------- Д Д ( . deduplicatio — ) — , . ' . В , . , , . З , є ,
  • 11. , . В є . є є : є, є , . є, є . , , . Д є , є , , . є — є. Д є : , , ZFS[1], IPFS[2]; — є ; — є snapshot- , ; — є . : ; ; . ZFS[3]. Д RLE. Д , є GNU- fdupes. В є . , [4]: ( , SHA-1, SHA-256, MD5) ; ; (fletcher4) є . ---------------------------------------------------------- Deduplikace - Čeština Zjednodušené schema deduplikace Deduplikace je speciální technika komprese dat, která zabraňuje ukládání stejných datových bloků na jednom úložišti. Deduplikační jednotka ukládá informace (referenční informace) o datové struktuře a díky tomu je schopná při zpětném čtení deduplikovaných dat zpět obnovit původní,
  • 12. komplexní informaci. Účelem deduplikace je úspora místa na datovém úložišti. Kromě této varianty, tzv. blokové deduplikace, existuje ještě deduplikace na úrovni souborů, kdy je ukládána pouze jedna kopie (instance) souboru/přílohy e-mailu. Příkladem budiž ukládání e-mailových zpráv v systému Microsoft Exchange[1], nebo Windows Single Instance Storage[2]. eDuplication - English Deduplikace - Čeština Deduplikation - Deutsch Deduplicación Español Déduplication - Français Datadeduplisering - Norsk Д - Д - DeDuplication , Online - Text Tools http://www.text-filter.com/tools