This document provides a survey of peer-to-peer content distribution technologies. It begins with defining key concepts of peer-to-peer computing and classifying peer-to-peer systems. The focus is on content distribution systems, which allow personal computers to function as a distributed storage medium for digital content. The document proposes a framework for analyzing nonfunctional characteristics and architectural designs of current peer-to-peer content distribution systems.
Injustice - Developers Among Us (SciFiDevCon 2024)
A survey of peer-to-peer content distribution technologies
1. A Survey of Peer-to-Peer Content Distribution Technologies
STEPHANOS ANDROUTSELLIS-THEOTOKIS AND DIOMIDIS SPINELLIS
Athens University of Economics and Business
Distributed computer architectures labeled “peer-to-peer” are designed for the sharing
of computer resources (content, storage, CPU cycles) by direct exchange, rather than
requiring the intermediation or support of a centralized server or authority.
Peer-to-peer architectures are characterized by their ability to adapt to failures and
accommodate transient populations of nodes while maintaining acceptable connectivity
and performance.
Content distribution is an important peer-to-peer application on the Internet that
has received considerable research attention. Content distribution applications
typically allow personal computers to function in a coordinated manner as a distributed
storage medium by contributing, searching, and obtaining digital content.
In this survey, we propose a framework for analyzing peer-to-peer content
distribution technologies. Our approach focuses on nonfunctional characteristics such
as security, scalability, performance, fairness, and resource management potential, and
examines the way in which these characteristics are reflected in—and affected by—the
architectural design decisions adopted by current peer-to-peer systems.
We study current peer-to-peer systems and infrastructure technologies in terms of
their distributed object location and routing mechanisms, their approach to content
replication, caching and migration, their support for encryption, access control,
authentication and identity, anonymity, deniability, accountability and reputation, and
their use of resource trading and management schemes.
Categories and Subject Descriptors: C.2.1 [Computer-Communication Networks]:
Network Architecture and Design—Network topology; C.2.2 [Computer-
Communication Networks]: Network Protocols—Routing protocols; C.2.4
[Computer-Communication Networks]: Distributed Systems—Distributed
databases; H.2.4 [Database Management]: Systems—Distributed databases; H.3.4
[Information Storage and Retrieval]: Systems and Software—Distributed systems
General Terms: Algorithms, Design, Performance, Reliability, Security
Additional Key Words and Phrases: Content distribution, DOLR, DHT, grid computing,
p2p, peer-to-peer
1. INTRODUCTION systems such as [Gnutella 2003], Seti@
Home [SetiAtHome 2003], OceanStore
A new wave of network architectures [Kubiatowicz et al. 2000], and many
labeled peer-to-peer is the basis of others. Such architectures are gener-
operation of distributed computing ally characterized by the direct sharing
Authors’ address: Athens University of Economics and Business, 76 Patission St., GR-104 34, Athens, Greece;
email: stheotok@aueb.gr.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or direct commercial advantage and
that copies show this notice on the first page or initial screen of a display along with the full citation.
Copyrights for components of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any
component of this work in other works requires prior specific permission and/or a fee. Permissions may be
requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212)
869-0481, or permissions@acm.org.
c 2004 ACM 0360-0300/04/1200-0335 $5.00
ACM Computing Surveys, Vol. 36, No. 4, December 2004, pp. 335–371.
2. 336 S. Androutsellis-Theotokis and D. Spinellis
of computer resources (CPU cycles, stor- We then present the main attributes of
age, content) rather than requiring the in- peer-to-peer content distribution systems,
termediation of a centralized server. and the aspects of their architectural de-
The motivation behind basing appli- sign which affect these attributes are an-
cations on peer-to-peer architectures alyzed with reference to specific existing
derives to a large extent from their ability peer-to-peer content distribution systems
to function, scale, and self-organize in and technologies.
the presence of a highly transient popu- Throughout this report the terms
lation of nodes, network, and computer “node”, “peer” and “user” are used inter-
failures, without the need of a central changeably, according to the context, to re-
server and the overhead of its adminis- fer to the entities that are connected in a
tration. Such architectures typically have peer-to-peer network.
as inherent characteristics scalability,
resistance to censorship and centralized
1.1. Defining Peer-to-Peer Computing
control, and increased access to resources.
Administration, maintenance, respon- A quick look at the literature reveals a con-
sibility for the operation, and even the siderable number of different definitions
notion of “ownership” of peer-to-peer sys- of “peer-to-peer”, mainly distinguished
tems are also distributed among the users, by the “broadness” they attach to the
instead of being handled by a single com- term.
pany, institution or person (see also Agre The strictest definitions of “pure” peer-
[2003] for an interesting discussion of in- to-peer refer to totally distributed sys-
stitutional change through decentralized tems, in which all nodes are completely
architectures). Finally, peer-to-peer archi- equivalent in terms of functionality and
tectures have the potential to accelerate tasks they perform. These definitions fail
communication processes and reduce to encompass, for example, systems that
collaboration costs through the ad hoc ad- employ the notion of “supernodes” (nodes
ministration of working groups [Schoder that function as dynamically assigned
and Fischbach 2003]. localized mini-servers) such as Kazaa
This report surveys peer-to-peer con- [2003], which are, however, widely ac-
tent distribution technologies, aiming to cepted as peer-to-peer, or systems that
provide a comprehensive account of ap- rely on some centralized server infras-
plications, features, and implementation tructure for a subset of noncore tasks
techniques. As this is a new and (thank- (e.g. bootstrapping, maintaining reputa-
fully) rapidly evolving field, and advances tion ratings, etc).
and new capabilities are constantly being According to a broader and widely ac-
introduced, this article will be present- cepted definition in Shirky [2000], “peer-
ing what essentially constitutes a “snap- to-peer is a class of applications that take
shot” of the state of the art around the advantage of resources—storage, cycles,
time of its writing—as is unavoidably the content, human presence—available at
case for any survey of a thriving research the edges of the internet”. This defini-
field. We do, however, believe that the tion, however, encompasses systems that
core information and principles presented completely rely upon centralized servers
will remain relevant and useful for the for their operation (such as seti@home,
reader. various instant messaging systems, or
In the next section, we define the basic even the notorious Napster), as well as
concepts of peer-to-peer computing. We various applications from the field of Grid
classify peer-to-peer systems into three computing.
categories (communication and collab- Overall, it is fair to say that there is
oration, distributed computation, and no general agreement about what “is” and
content distribution). Content distribu- what “is not” peer-to-peer.
tion systems are further discussed and We feel that this lack of agreement on
categorized. a definition—or rather the acceptance of
ACM Computing Surveys, Vol. 36, No. 4, December 2004.
3. A Survey of Content Distribution Technologies 337
various different definitions—is, to a large nections fail or recover, in order to main-
extent, due to the fact that systems or tain its connectivity and performance.
applications are labeled “peer-to-peer” not
because of their internal operation or ar- We therefore propose the following defi-
chitecture, but rather as a result of how nition:
they are perceived “externally”, that is, Peer-to-peer systems are distributed systems
whether they give the impression of pro- consisting of interconnected nodes able to self-
viding direct interaction between comput- organize into network topologies with the purpose
ers. As a result, different definitions of of sharing resources such as content, CPU cycles,
“peer-to-peer” are applied to accommodate storage and bandwidth, capable of adapting to
the various different cases of such systems failures and accommodating transient popula-
or applications. tions of nodes while maintaining acceptable con-
From our perspective, we believe that nectivity and performance, without requiring the
the two defining characteristics of peer-to- intermediation or support of a global centralized
server or authority.
peer architectures are the following:
—The sharing of computer resources by This definition is meant to encompass
direct exchange, rather than requir- “degrees of centralization” ranging from
ing the intermediation of a centralized the pure, completely decentralized sys-
server. Centralized servers can some- tems such as Gnutella, to “partially cen-
times be used for specific tasks (sys- tralized” systems1 such as Kazaa. How-
tem bootstrapping, adding new nodes ever, for the purposes of this survey, we
to the network, obtain global keys for shall not restrict our presentation and dis-
data encryption), however, systems that cussion of architectures and systems to
rely on one or more global centralized our own proposed definition, and we will
servers for their basic operation (e.g. for take into account systems that are consid-
maintaining a global index and search- ered peer-to-peer by other definitions as
ing through it—Napster, Publius) are well, including systems that employ a cen-
clearly stretching the definition of peer- tralized server (such as Napster, instant
to-peer. messaging applications, and others).
As the nodes of a peer-to-peer network The focus of our study is content distri-
cannot rely on a central server coordi- bution, a significant area of peer-to-peer
nating the exchange of content and the systems that has received considerable re-
operation of the entire network, they are search attention.
required to actively participate by inde-
pendently and unilaterally performing 1.2. Peer-to-Peer and Grid Computing
tasks such as searching for other nodes, Peer-to-peer and Grid computing are two
locating or caching content, routing in- approaches to distributed computing, both
formation and messages, connecting to concerned with the organization of re-
or disconnecting from other neighboring source sharing in large-scale computa-
nodes, encrypting, introducing, retriev- tional societies.
ing, decrypting and verifying content, as Grids are distributed systems that en-
well as others. able the large-scale coordinated use and
—Their ability to treat instability and sharing of geographically distributed re-
variable connectivity as the norm, au- sources, based on persistent, standards-
tomatically adapting to failures in both based service infrastructures, often with
network connections and computers, as a high-performance orientation [Foster
well as to a transient population of et al. 2001].
nodes.
This fault-tolerant, self-organizing ca- 1 In our definition, we refer to “global” servers, to
pacity suggests the need for an adap- make a distinction from the dynamically assigned
tive network topology that will change “supernodes” of partially centralized systems (see
as nodes enter or leave and network con- also Section 3.3.3)
ACM Computing Surveys, Vol. 36, No. 4, December 2004.
4. 338 S. Androutsellis-Theotokis and D. Spinellis
As Grid systems increase in scale, they Chat/Irc, Instant Messaging (Aol, Icq,
begin to require solutions to issues of self- Yahoo, Msn), and Jabber [Jabber 2003].
configuration, fault tolerance, and scala- Distributed Computation. This category
bility, for which peer-to-peer research has includes systems whose aim is to take ad-
much to offer. vantage of the available peer computer
Peer-to-peer systems, on the other hand, processing power (CPU cycles). This is
focus on dealing with instability, tran- achieved by breaking down a computer-
sient populations, fault tolerance, and self- intensive task into small work units and
adaptation. To date, however, peer-to-peer distributing them to different peer com-
developers have worked mainly on verti- puters, that execute their corresponding
cally integrated applications, rather than work unit and return the results. Cen-
seeking to define common protocols and tral coordination is invariably required,
standardized infrastructures for interop- mainly for breaking up and distributing
erability. the tasks and collecting the results. Exam-
In summary, one can say that “Grid com- ples of such systems include projects such
puting addresses infrastructure, but not as Seti@home [Sullivan III et al. 1997;
yet failure, while peer-to-peer addresses SetiAtHome 2003], genome@home [Lar-
failure, but not yet infrastructure” [Foster son et al. 2003; GenomeAtHome 2003],
and Iamnitchi 2003]. and others.
In addition to this, the form of sharing Internet Service Support. A number of
initially targeted by peer-to-peer has been different applications based on peer-to-
of limited functionality, providing a global peer infrastructures have emerged for
content distribution and filesharing space supporting a variety of Internet ser-
lacking any form of access control. vices. Examples of such applications
As peer-to-peer technologies move into include peer-to-peer multicast systems
more sophisticated and complex applica- [VanRenesse et al. 2003; Castro et al.
tions, such as structured content distri- 2002], Internet indirection infrastruc-
bution, desktop collaboration, and net- tures [Stoica et al. 2002], and secu-
work computation, it is expected that rity applications, providing protection
there will be a strong convergence be- against denial of service or virus attacks
tween peer-to-peer and Grid computing [Keromytis et al. 2002; Janakiraman et al.
[Foster 2000]. The result will be a new 2003; Vlachos et al. 2004].
class of technologies combining elements Database Systems. Considerable work
of both peer-to-peer and Grid comput- has been done on designing distributed
ing, which will address scalability, self- database systems based on peer-to-peer
adaptation, and failure recovery, while, infrastructures. Bernstein et al. [2002]
at the same time, providing a persis- propose the Local Relational Model
tent and standardized infrastructure for (LRM), in which the set of all data stored
interoperability. in a peer-to-peer network is assumed to
be comprised of inconsistent local rela-
tional databases interconnected by sets
1.3. Classification of Peer-to-Peer
of “acquaintances” that define translation
Applications
rules and semantic dependencies between
Peer-to-peer architectures have been em- them. PIER [Huebsch et al. 2003] is a
ployed for a variety of different application scalable distributed query engine built
categories, which include the following. on top of a peer-to-peer overlay network
Communication and Collaboration. topology that allows relational queries
This category includes systems that to run across thousands of computers.
provide the infrastructure for facilitating The Piazza system [Halevy et al. 2003]
direct, usually real-time, communica- provides an infrastructure for building
tion and collaboration between peer semantic Web [Berners-Lee et al. 2001]
computers. Examples include chat and applications, consisting of nodes that can
instant messaging applications, such as supply either source data (e.g. from a
ACM Computing Surveys, Vol. 36, No. 4, December 2004.
5. A Survey of Content Distribution Technologies 339
relational database), schemas (or ontolo- An examination of current peer-to-peer
gies) or both. Piazza nodes are transitively technologies in this context suggests that
connected by chains of mappings between they can be grouped as follows (with ex-
pairs of nodes, allowing queries to be amples in Tables I and II).
distributed across the Piazza network.
Finally, Edutella [Nejdl et al. 2003] is an —Peer-to-Peer Applications. This category
open-source project that builds on the includes content distribution systems
W3C metadata standard RDF, to provide that are based on peer-to-peer tech-
a metadata infrastructure and querying nology. We attempt to further subdi-
capability for peer-to-peer applications. vide them into the following two groups,
Content Distribution. Most of the cur- based on their application goals and per-
rent peer-to-peer systems fall within the ceived complexity:
category of content distribution, which in- Peer-to-peer “file exchange” systems.
cludes systems and infrastructures de- These systems are targeted towards
signed for the sharing of digital media simple, one-off file exchanges between
and other data between users. Peer-to- peers. They are used for setting up
peer content distribution systems range a network of peers and providing fa-
from relatively simple direct fileshar- cilities for searching and transferring
ing applications, to more sophisticated files between them. These are typically
systems that create a distributed stor- light-weight applications that adopt a
age medium for securely and efficiently best-effort approach without addressing
publishing, organizing, indexing, search- security, availability, and persistence. It
ing, updating, and retrieving data. There is mainly systems in this category that
are numerous such systems and in- are responsible for spawning the repu-
frastructures. Some examples are: the tation (and in some cases notoriety) of
late Napster, Publius [Waldman et al. peer-to-peer technologies.
2000], Gnutella [Gnutella 2003], Kazaa Peer-to-peer content publishing and stor-
[Kazaa 2003], Freenet [Clarke et al. age systems. These systems are targeted
2000], MojoNation [MojoNation 2003], towards creating a distributed storage
Oceanstore [Kubiatowicz et al. 2000], medium in—and through—which users
PAST [Druschel and Rowstron 2001], will be able to publish, store, and dis-
Chord [Stoica et al. 2001], Scan [Chen tribute content in a secure and persis-
et al. 2000], FreeHaven [Dingledine tent manner. Such content is meant to
et al. 2000], Groove [Groove 2003], and be accessible in a controlled manner by
Mnemosyne [Hand and Roscoe 2002]. peers with appropriate privileges. The
This survey will focus on content distri- main focus of such systems is security
bution, one of the most prominent appli- and persistence, and often the aim is
cation areas of peer-to-peer systems. to incorporate provisions for account-
ability, anonymity and censorship resis-
tance, as well as persistent content man-
1.4. Peer-to-Peer Content Distribution agement (updating, removing, version
In its most basic form, a peer-to-peer con- control) facilities.
tent distribution system creates a dis- —Peer-to-Peer Infrastructures. This cate-
tributed storage medium that allows for gory includes peer-to-peer based infras-
the publishing, searching, and retrieval tructures that do not constitute working
of files by members of its network. As applications, but provide peer-to-peer
systems become more sophisticated, non- based services and application frame-
functional features may be provided, in- works. The following infrastructure ser-
cluding provisions for security, anonymity, vices are identified:
fairness, increased scalability and perfor- Routing and location. Any peer-to-peer
mance, as well as resource management content distribution system relies on
and organization capabilities. All of these a network of peers within which re-
will be discussed in the following sections. quests and messages must be routed
ACM Computing Surveys, Vol. 36, No. 4, December 2004.
6. 340 S. Androutsellis-Theotokis and D. Spinellis
Table I. A Classification of Current Peer-to-Peer Systems
(RM: Resource Management; CR: Censorship Resistance; PS: Performance and Scalability; SPE: Security,
Privacy and Encryption; A: Anonymity; RA: Reputation and Accountability; RT: Resource Trading.)
Peer-to-Peer File Exchange Systems
System Brief Description
Napster Distributed file sharing—hybrid decentralized.
Kazaa [2003] Distributed file sharing—partially centralized.
Gnutella [2003] Distributed file sharing—purely decentralized.
Peer-to-Peer Content Publishing and Storage Systems
System Brief Description Main Focus
Scan [Chen et al. 2000] A dynamic, scalable, efficient content distribution PS
network. Provides dynamic content replication.
Publius [Waldman et al. 2000] A censorship-resistant system for publishing RM
content. Static list of servers. Enhanced content
management (update and delete).
Groove [Groove 2003] Internet communications software for direct RM,PS,SPE
real-time peer-to-peer interaction.
FreeHaven [Dingledine et al. 2000] A flexible system for anonymous storage. A,RA
Freenet [Clarke et al. 2000] Distributed anonymous information storage and A,RA
retrieval system.
MojoNation [MojoNation 2003] Distributed file storage. Fairness through the use SPE,RT
of currency mojo.
Oceanstore [Kubiatowicz et al. 2000] An architecture for global scale persistent storage. RM,PS,SPE
Scalable, provides security and access control.
Intermemory [Chen et al. 1999] System of networked computers. Donate storage RT
in exchange for the right to publish data.
Mnemosyne [Hand and Roscoe 2002] Peer-to-peer steganographic storage system. SPE
Provides privacy and plausible deniability.
PAST [Druschel and Rowstron 2001] Large scale persistent peer-to-peer storage utility. PS,SPE
Dagster [Stubblefield and Wallach 2001] A censorship-resistant document publishing CR,SPE
system.
Tangler [Waldman and Mazi 2001] A content publishing system based on document CR,SPE
entanglements.
with efficiency and fault tolerance, and 2. ANALYSIS FRAMEWORK
through which peers and content can
In this survey, we present a description
be efficiently located. Different infras-
and analysis of applications, systems, and
tructures and algorithms have been
infrastructures that are based on peer-
developed to provide such services.
to-peer architectures and aim at either
Anonymity. Peer-to-peer based infras-
offering content distribution solutions, or
tructure systems have been designed
supporting content distribution related
with the explicit aim of providing user
activities.
anonymity.
Our approach is based on:
Reputation Management. In a peer-
to-peer network, there is no central —identifying the feature space of nonfunc-
organization to maintain reputation tional properties and characteristics of
information for users and their behavior. content distribution systems,
Reputation information is, therefore, —determining the way in which the non-
hosted in the various network nodes. In functional properties depend on, and
order for such reputation information to can be affected by, various design fea-
be kept secure, up-to-date, and avail- tures, and
able throughout the network, complex —providing an account, analysis, and
reputation management infrastructures evaluation of the design features and
need to be employed. solutions that have been adopted by
ACM Computing Surveys, Vol. 36, No. 4, December 2004.
7. A Survey of Content Distribution Technologies 341
Table II. A Classification of Current Peer-to-Peer Infrastructures
Infrastructures for routing and location
Chord [Stoica et al. 2001] A scalable peer-to-peer lookup service. Given a key, it
maps the key to a node.
CAN [Ratnasamy et al. 2001] Scalable content addressable network. A distributed
infrastructure that provides hash-table functionality
for mapping file names to their locations.
Pastry [Rowstron and Druschel 2001] Infrastructure for fault-tolerant wide-area location and
routing.
Tapestry [Zhao et al. 2001] Infrastructure for fault-tolerant wide-area location and
routing.
Kademlia [Mayamounkov and Mazieres 2002] A scalable peer-to-peer lookup service based on the
XOR metric.
Infrastructures for anonymity
Anonymous remailer mixnet [Berthold et al. 1998] Infrastructure for anonymous connection.
Onion Routing [Goldschlag et al. 1999] Infrastructure for anonymous connection.
ZeroKnowledge Freedom [Freedom 2003] Infrastructure for anonymous connection.
Tarzan [Freedman et al. 2002] A peer-to-peer decentralized anonymous network layer.
Infrastructures for reputation management
Eigentrust [Kamvar et al. 2003] A Distributed reputation management algorithm.
A Partially distributed reputation management A partially distributed approach based on a
system [Gupta et al. 2003] debit/credit or debit-only scheme.
PeerTrust [Xiong and Liu 2002] A decentralized, feedback-based reputation
management system using satisfaction and number of
interaction metrics.
current peer-to-peer systems, as well as data and processing methods. Unautho-
their shortcomings, potential improve- rized entities cannot change data; ad-
ments, and proposed alternatives. versaries cannot substitute a forged doc-
ument for a requested one.
For the identification of the nonfunc- Privacy and confidentiality. Ensuring
tional characteristics on which our study that data is accessible only to those
is based, we used the work of Shaw and authorized to have access, and that
Garlan [1995], which we adapted to the there is control over what data is col-
field of peer-to-peer content distribution. lected, how it is used, and how it is
The various design features and solu- maintained.
tions that we examine, as well as their
Availability and persistence. Ensuring
relationship to the relevant nonfunctional
that authorized users have access to
characteristics, were assembled through
data and associated assets when re-
a detailed analysis and study of current
quired. For a peer-to-peer content dis-
peer-to-peer content distribution systems
tribution system this often means al-
that are either deployed, researched, or
ways. This property entails stability in
proposed.
the presence of failure, or changing node
The resulting analysis framework is il-
populations.
lustrated in Figure 1. The boxes in the
periphery show the most important at- Scalability. Maintaining the system’s per-
tributes of peer-to-peer content distribu- formance attributes independent of the
tion systems, described as follows. number of nodes or documents in its
network. A dramatic increase in the
Security. Further analyzed in terms of: number of nodes or documents will
Integrity and authenticity. Safeguard- have minimal effect on performance and
ing the accuracy and completeness of availability.
ACM Computing Surveys, Vol. 36, No. 4, December 2004.
8. 342 S. Androutsellis-Theotokis and D. Spinellis
Fig. 1. Illustration of the way in which various design features affect the main characteristics of peer-
to-peer content distribution systems.
Performance. The time required for per- grouping, based on the content itself;
forming the operations allowed by the grouping, based on locality or network
system, typically publication, searching, distance; grouping, based on organiza-
and retrieval of documents. tion ties, as well as others.
Fairness. Ensuring that users offer and The relationships between nonfunc-
consume resources in a fair and bal- tional characteristics are depicted as a
anced manner. May rely on account- UML diagram in Figure 1. The Figure
ability, reputation, and resource trading schematically illustrates the relationship
mechanisms. between the various design features and
Resource Management Capabilities. In the main characteristics of peer-to-peer
their most basic form, peer-to-peer con- content distribution systems.
tent distribution systems allow the pub- The boxes in the center show the design
lishing, searching, and retrieval of docu- decisions that affect these attributes. We
ments. More sophisticated systems may note that these design decisions are mostly
afford more advanced resource manage- independent and orthogonal.
ment capabilities, such as editing or We see, for example, that the perfor-
removal of documents, management of mance of a peer-to-peer content distribu-
storage space, and operations on meta- tion system is affected by the distributed
data. object location and routing mechanisms,
Semantic Grouping of Information. An as well as by the data replication,
area of research that has attracted con- caching, and migration algorithms. Fair-
siderable attention recently is the se- ness, on the other hand, depends on
mantic grouping and organization of the system’s provisions for accountabil-
content and information in peer-to-peer ity and reputation, as well as on any re-
networks. Various grouping schemes source trading mechanisms that may be
are encountered, such as semantic implemented.
ACM Computing Surveys, Vol. 36, No. 4, December 2004.
9. A Survey of Content Distribution Technologies 343
In the following sections of this survey, such networks are often termed “servents”
the different design decisions and features (SERVers+clieENTS).
will be presented and discussed, based on Partially Centralized Architectures. The
an examination of existing peer-to-peer basis is the same as with purely decentral-
content distribution systems, and with ized systems. Some of the nodes, however,
reference to the way in which they affect assume a more important role, acting as
the main attributes of these systems, local central indexes for files shared by lo-
as presented. A relevant presentation cal peers. The way in which these supern-
and discussion of various desired prop- odes are assigned their role by the network
erties of peer-to-peer systems can also varies between different systems. It is im-
be found in Kubiatowics [2003], while portant, however, to note that these su-
an analysis of peer-to-peer systems from pernodes do not constitute single points
the end-user perspective can be found in of failure for a peer-to-peer network, since
Lee [2003]. they are dynamically assigned and, if they
fail, the network will automatically take
3. PEER-TO-PEER DISTRIBUTED OBJECT action to replace them with others.
LOCATION AND ROUTING Hybrid Decentralized Architectures. In
these systems, there is a central server fa-
The operation of any peer-to-peer content cilitating the interaction between peers by
distribution system relies on a network maintaining directories of metadata, de-
of peer computers (nodes), and connec- scribing the shared files stored by the peer
tions (edges) between them. This network nodes. Although the end-to-end interac-
is formed on top of—and independently tion and file exchanges may take place di-
from—the underlying physical computer rectly between two peer nodes, the central
(typically IP) network, and is thus referred servers facilitate this interaction by per-
to as an “overlay” network. The topology, forming the lookups and identifying the
structure, and degree of centralization of nodes storing the files. The terms “peer-
the overlay network, and the routing and through-peer” or “broker mediated” are
location mechanisms it employs for mes- sometimes used for such systems [Kim
sages and content are crucial to the oper- 2001].
ation of the system, as they affect its fault Obviously, in these architectures, there
tolerance, self-maintainability, adaptabil- is a single point of failure (the central
ity to failures, performance, scalability, server). This typically renders them inher-
and security; in short almost all of the sys- ently unscalable and vulnerable to censor-
tem’s attributes as laid out in Section 2 ship, technical failure, or malicious attack.
(see also Figure 1).
Overlay networks can be distinguished
in terms of their centralization and 3.2. Network Structure
structure. By structure, we refer to whether the over-
lay network is created nondeterministi-
3.1. Overlay Network Centralization
cally (ad hoc) as nodes and content are
Although in their purest form peer-to-peer added, or whether its creation is based
overlay networks are supposed to be to- on specific rules. We categorize peer-to-
tally decentralized, in practice this is not peer networks as follows, in terms of their
always true, and systems with various de- structure:
grees of centralization are encountered. Unstructured. The placement of content
Specifically, the following three categories (files) is completely unrelated to the over-
are identified. lay topology.
Purely Decentralized Architectures. All In an unstructured network, content
nodes in the network perform exactly the typically needs to be located. Searching
same tasks, acting both as servers and mechanisms range from brute force meth-
clients, and there is no central coordi- ods, such as flooding the network with
nation of their activities. The nodes of propagating queries in a breadth-first
ACM Computing Surveys, Vol. 36, No. 4, December 2004.
10. 344 S. Androutsellis-Theotokis and D. Spinellis
or depth-first manner until the desired Table III. A classification of Peer-to-Peer Content
content is located, to more sophisticated Distribution Systems and Location and Routing
Infrastructures in Terms of Their Network Structure,
and resource-preserving strategies that With Some Typical Examples
include the use of random walks and rout-
ing indices (discussed in more detail in Centralization
Section 3.3.4). The searching mechanisms
employed in unstructured networks have Hybrid Partial None
obvious implications, particularly in re- Unstructured Napster, Kazaa, Gnutella,
gards to matters of availability, scalability, Publius Mor- FreeHaven
and persistence. pheus,
Unstructured systems are generally Gnutella,
Edutella
more appropriate for accommodating
highly-transient node populations. Some Structured Chord,
representative examples of unstructured Infrastructures CAN,
systems are Napster, Publius [Waldman Tapestry,
Pastry
et al. 2000], Gnutella [Gnutella 2003],
Kazaa [Kazaa 2003], Edutella [Nejdl et al. Structured OceanStore,
2003], FreeHaven [Dingledine et al. 2000], Systems Mnemosyne,
Scan, PAST,
as well as others. Kademlia,
Structured. These have emerged mainly Tarzan
in an attempt to address the scalability
issues that unstructured systems were A category of networks that are in be-
originally faced with. In structured net- tween structured and unstructured are re-
works, the overlay topology is tightly ferred to as loosely structured networks.
controlled and files (or pointers to them) Although the location of content is not
are placed at precisely specified locations. completely specified, it is affected by rout-
These systems essentially provide a map- ing hints. A typical example is Freenet
ping between content (e.g. file identifier) [Clarke et al. 2000; Clake et al. 2002].
and location (e.g. node address), in the Table III summarizes the categories we
form of a distributed routing table, so that outlined, with examples of peer-to-peer
queries can be efficiently routed to the content distribution systems and architec-
node with the desired content [Lv et al. tures. Note that all structured and loosely
2002]. structured systems are inherently purely
Structured systems offer a scalable so- decentralized; form follows function.
lution for exact-match queries, that is, In the following sections, the overlay
queries where the exact identifier of the network topology and operation of differ-
requested data object is known (as com- ent peer-to-peer systems is discussed, fol-
pared to keyword queries). Using exact- lowing the above classification, according
match queries as a substrate for keyword to degree of centralization and network
queries remains an open research problem structure.
for distributed environments [Witten et al.
1999]. 3.3. Unstructured Architectures
A disadvantage of structured systems is 3.3.1. Hybrid Decentralized. Figure 2 il-
that it is hard to maintain the structure lustrates the architecture of a typical hy-
required for efficiently routing messages brid decentralized peer-to-peer system.
in the face of a very transient node popu- Each client computer stores content
lation, in which nodes are joining and leav- (files) shared with the rest of the network.
ing at a high rate [Lv et al. 2002]. All clients connect to a central directory
Typical examples of structured systems server that maintains:
include Chord [Stoica et al. 2001], CAN
[Ratnasamy et al. 2001], PAST [Druschel —A table of registered user connection in-
and Rowstron 2001], Tapestry [Zhao et al. formation (IP address, connection band-
2001] among others. width etc.)
ACM Computing Surveys, Vol. 36, No. 4, December 2004.
11. A Survey of Content Distribution Technologies 345
torious Napster and Publius systems
[Waldman et al. 2000] that rely on a static,
system-wide list of servers. Their architec-
ture does not provide any smooth, decen-
tralized support for adding a new server,
or removing dead or malicious servers. A
comprehensive study of the behavior of
hybrid peer-to-peer systems and a com-
parison of their query characteristics and
their performance in terms of bandwidth
and CPU cycle consumption is presented
in Yang and Garcia-Molina [2001].
It should be noted that systems that
do not fall under the hybrid decentral-
ized category may still use some central
administration server to a limited extent,
Fig. 2. Typical hybrid decentralized peer-to-peer for example, for initial system boot-
architecture. A central directory server maintains strapping (e.g. Mojonation [MojoNation
an index of the metadata for all files in the network.
2003]), or for allowing new users to
—A table listing the files that each user join the network by providing them
holds and shares in the network, along with access to a list of current users
with metadata descriptions of the files (e.g. gnutellahosts.com for the gnutella
(e.g. filename, time of creation, etc.) network).
A computer that wishes to join the 3.3.2. Purely Decentralized. In this sec-
network contacts the central server and tion, we examine the Gnutella network
reports the files it maintains. Client com- [Gnutella 2003], an interesting and rep-
puters send requests for files to the server. resentative member of purely decentral-
The server searches for matches in its in- ized peer-to-peer architectures, due to its
dex, returning a list of users that hold the open architecture, achieved scale, and self-
matching file. The user then opens direct organizing structure. FreeHaven [Dingle-
connections with one or more of the peers dine et al. 2000] is another system using
that hold the requested file, and down- routing and search mechanisms similar to
loads it (see Figure 2). those of Gnutella.
The advantage of hybrid decentralized Like most peer-to-peer systems,
systems is that they are simple to imple- Gnutella builds a virtual overlay net-
ment, and they locate files quickly and ef- work with its own routing mechanisms
ficiently. Their main disadvantage is that [Ripeanu and Foster 2002], allowing its
they are vulnerable to censorship, legal ac- users to share files with other peers.
tion, surveillance, malicious attack, and There is no central coordination of the
technical failure, since the content shared, activities in the network and users connect
or at least descriptions of it, and the abil- to each other directly through a software
ity to access it are controlled by the single application that functions both as a client
institution, company, or user maintain- and a server (users are referred to as a
ing the central server. Furthermore, these servents).
systems are considered inherently unscal- Gnutella uses IP as its underlying net-
able, as there are bound to be limitations work service, while the communication be-
to the size of the server database and its tween servents is specified in a form of
capacity to respond to queries. Large Web application level protocol supporting four
search engines have, however, repeatedly types of messages [Jovanovich et al. 2001]:
provided counterexamples to this notion.
Examples of hybrid decentralized con- Ping. A request for a certain host to an-
tent distribution systems include the no- nounce itself.
ACM Computing Surveys, Vol. 36, No. 4, December 2004.
12. 346 S. Androutsellis-Theotokis and D. Spinellis
Pong. Reply to a Ping message. It con- Once a node receives a QueryHit mes-
tains the IP and port of the responding sage, indicating that the target file has
host and number and size of the files being been identified at a certain node, it ini-
shared. tiates a download by establishing a di-
Query. A search request. It contains a rect connection between the two nodes.
search string and the minimum speed re- Figure 3 illustrates an example of the
quirements of the responding host. Gnutella search mechanism.
Query Hits. Reply to a Query message. Scalability issues in the original purely
It contains the IP, port, and speed of the decentralized systems arose from the fact
responding host, the number of match- that the use of the TTL effectively seg-
ing files found, and their indexed result mented the network into “sub-networks”,
set. imposing on each user a virtual “hori-
zon” beyond which their messages could
After joining the Gnutella network (by not reach [Jovanovic 2000]. Removing the
connecting to nodes found in databases TTL limit, on the other hand, would re-
such as gnutellahosts.com), a node sends sult in the network being swamped with
out a Ping message to any node it is messages.
connected to. The nodes send back a Significant research has been carried
Pong message identifying themselves, and out to address the above issues, and
also propagate the Ping message to their various solutions have been proposed.
neighbors. These will be discussed in more detail in
In order to locate a file in unstructured Section 3.3.4.
systems such as gnutella, nondeterminis-
tic searches are the only option since the 3.3.3. Partially Centralized. Partially cen-
nodes have no way of guessing where (at tralized systems are similar to purely
which nodes) the files may lie. decentralized, but they use the con-
The original Gnutella architecture uses cept of supernodes: nodes that are dy-
a flooding (or broadcast) mechanism to dis- namically assigned the task of servic-
tribute Ping and Query messages: each ing a small subpart of the peer network
Gnutella node forwards the received mes- by indexing and caching files contained
sages to all of its neighbors. The response therein.
messages received are routed back along Peers are automatically elected to be-
the opposite path through which the origi- come supernodes if they have sufficient
nal request arrived. To limit the spread of bandwidth and processing power (al-
messages through the network, each mes- though a configuration parameter may
sage header contains a time-to-live (TTL) allow users to disable this feature)
field. At each hop, the value of this field [FastTrack 2003].
is decremented, and when it reaches zero, Supernodes index the files shared by
the message is dropped. peers connected to them, and proxy search
The above mechanism is implemented requests on behalf of these peers. All
by assigning each message a unique iden- queries are therefore initially directed to
tifier and equipping each host with a dy- supernodes.
namic routing table of message identifiers Two major advantages of partially cen-
and node addresses. Since the response tralized systems are that:
messages contain the same ID as the orig-
inal messages, the host checks its routing —Discovery time is reduced in compari-
table to determine along which link the re- son with purely decentralized systems,
sponse message should be forwarded. In while there still is no unique point of
order to avoid loops, the nodes use the failure. If one or more supernodes go
unique message identifiers to detect and down, the nodes connected to them can
drop duplicate messages, to improve effi- open new connections with other su-
ciency, and preserve network bandwidth pernodes, and the network will continue
[Jovanovic 2000]. to operate.
ACM Computing Surveys, Vol. 36, No. 4, December 2004.
13. A Survey of Content Distribution Technologies 347
Fig. 3. An example of the Gnutella search mechanism. Solid lines between
the nodes represent connections of the Gnutella network. The search orig-
inates at the “requestor” node, for a file maintained by another node. Re-
quest messages are dispatched to all neighboring nodes, and propagated
from node-to-node as shown in the four consecutive steps (a) to (d). When
a node receives the same request message (based on its unique identifier)
multiple times, it replies with a failure message to avoid loops and minimize
traffic. When the file is identified, a success message is returned.
—The inherent heterogeneity of peer-to- mentation (as it is a proprietary system,
peer networks is taken advantage of, there is no detailed documentation on its
and exploited. In a purely decentralized structure and operation). Edutella [Nejdl
network, all of the nodes will be equally et al. 2003] is another partially centralized
(and usually heavily) loaded, regard- architecture.
less of their CPU power, bandwidth, or Yang and Garcia-Molina [2002a, 2002b]
storage capabilities. In partially central- present research addressing the design
ized systems, however, the supernodes of, and searching techniques for, partially
will undertake a large portion of the centralized peer-to-peer networks.
entire network load, while most of the The concept of supernodes has also been
other (so called “normal”) nodes will be proposed in a more recent version of the
very lightly loaded, in comparison (see Gnutella protocol. A mechanism for dy-
also [Lv et al. 2002; Zhichen et al. 2002]). namically selecting supernodes organizes
the Gnutella network into an interconnec-
Kazaa is a typical, widely used instance tion of superpeers (as they are referred to)
of a partially centralized system imple- and client nodes.
ACM Computing Surveys, Vol. 36, No. 4, December 2004.
14. 348 S. Androutsellis-Theotokis and D. Spinellis
When a node with enough CPU power sues by means of:
joins the network, it immediately be-
comes a superpeer and establishes con- —A dynamic topology adaptation protocol
nections with other superpeers, forming a that ensures that most nodes will be at a
flat unstructured network of superpeers. short distance from high capacity nodes.
If it establishes a minimum required num- This protocol, coupled with replicating
ber of connections to client nodes within a pointers to the content of neighbors, en-
specified time, it remains a superpeer. Oth- sures that high capacity nodes will be
erwise, it turns into a regular client node. able to provide answers to a very large
number of queries.
3.3.4. Shortcomings and Evolution of Unstruc- —An active flow control scheme that ex-
tured Architectures. A number of methods ploits network heterogeneity to avoid
have been proposed to overcome the origi- hotspots, and
nal unscalability of unstructured peer-to- —A search protocol that is based on ran-
peer systems. These have been shown to dom walks directed towards high capac-
drastically improve their performance. ity nodes.
Lv et al. [2002] proposed to replace
the original flooding approach by multiple Simulations of Gia were found to in-
parallel random walks, where each node crease overall system capacity by three
chooses a neighbor at random, and propa- to five orders of magnitude. Similar ap-
gates the request only to it. The use of ran- proaches, based on taking advantage of
dom walks, combined with proactive object the underlying network heterogeneity, are
replication (discussed in Section 4), was described in Lv et al. [2002] and Zhichen
found to significantly improve the perfor- et al. [2002].
mance of the system as measured by query In another approach, Crespo and
resolution time (in terms of numbers of Garcia-Molina [2002] use routing indices
hops), per-node query load, and message to address the searching and scalability
traffic generated. Proactive data replica- issues. Routing indices are tables of infor-
tion schemes are further examined in Co- mation about other nodes, stored within
hen and Shenker [2001]. each node. They provide a list of neighbors
Yang and Garcia-Molina [2002b] sug- that are most likely to be “in the direction”
gested the use of more sophisticated of the content corresponding to the query.
broadcast policies, selecting which neigh- These tables contain information about
bors to forward search queries to based on the total number of files maintained by
their past history, as well as the use of local nearby nodes, as well as the number
indices: data structures where each node of files corresponding to various topics.
maintains an index of the data stored at Three different variations of the approach
nodes located within a radius from itself. are presented, and simulations are shown
A similar solution to the information re- to improve performance by 1-2 orders
trieval problem is proposed by Kalogeraki of magnitude with respect to flooding
et al. [2002] in the form of an Intelli- techniques.
gent Search Mechanism built on top of Finally, the connectivity properties and
a Modified Random BFS Search Mecha- reliability of unstructured peer-to-peer
nism. Each peer forwards queries to a sub- networks such as Gnutella have been
set of its neighbors, selecting them based studied in Ripeanu and Foster [2002]. In
on a profile mechanism that maintains in- particular, emphasis was placed on the
formation about their performance in re- fact that peer-to-peer networks exhibit
cent queries. Neighbours are ranked ac- the properties of power-law networks,
cording to their profiles and queries are in which the number of nodes with L
forwarded selectively to the most appro- links is proportional to L−k , where k is
priate ones only. a network dependent constant. In other
Chawathe et al. [2003] presented the words, most nodes have few links (thus, a
Gia System, which addresses the above is- large fraction of them can be taken away
ACM Computing Surveys, Vol. 36, No. 4, December 2004.
15. A Survey of Content Distribution Technologies 349
without seriously damaging the network —CAN is a system using an n-dimensional
connectivity), while there are a few highly- Cartesian coordinate space to imple-
connected nodes which, if taken away, are ment the distributed location and rout-
likely to cause the whole network to be ing table, whereby each node is respon-
broken down in disconnected pieces. One sible for a zone in the coordinate space.
implication of this is that such networks —Tapestry (and the similar Pastry and
are robust when facing random node Kademlia) are based on the plaxton
attacks, yet vulnerable to well-planned mesh data structure, which maintains
attacks. The topology mismatch between pointers to nodes in the network whose
the Gnutella network and the underlying IDs match the elements of a tree-like
physical network infrastructure was also structure of ID prefixes up to a digit
documented in Ripeanu and Foster [2002]. position.
Tsoumakos and Roussopoulos [2003]
present a more comprehensive analysis 3.4.1. Freenet—A Loosely Structured Sys-
and comparison of the above methods. tem. The defining characteristic of loosely
Overall, unstructured peer-to-peer con- structured systems is that the nodes of the
tent distribution systems might be the peer-to-peer network can produce an esti-
preferred choice for applications where the mate (not with certainty) of which node is
following assumptions hold: most likely to store certain content. This
affords them the possibility of avoiding
—keyword searching is the common oper- blindly broadcasting request messages to
ation, all (or a random subset) of their neighbors.
—most content is typically replicated at a Instead, they use a chain mode propaga-
fair fraction of participating sites, tion approach, where each node makes a
—the node population is highly transient, local decision about which node to send the
—users will accept a best-effort content re- request message to next.
trieval approach, and Freenet [Clarke et al. 2000] is a typical,
purely decentralized loosely-structured
—the network size is not so large as to in-
content distribution system. It operates
cur scalability problems [Lv et al. 2002]
as a self-organizing peer-to-peer network,
(The last issue is alleviated through
pooling unused disk space in peer com-
the various methods described in this
puters to create a collaborative virtual
Section).
file system. Important features of Freenet
include its focus on security, publisher
3.4. Structured Architectures anonymity, deniability, and data replica-
The various structured content distribu- tion for availability and performance.
tion systems and infrastructures employ Files in Freenet are identified by unique
different mechanisms for routing mes- binary keys. Three types of keys are sup-
sages and locating data. Four of the most ported, the simplest is based on applying
interesting and representative mecha- a hash function on a short descriptive text
nisms and their corresponding systems string that accompanies each file as it is
are examined in the following sections. stored in the network by its original owner.
Each Freenet node maintains its own lo-
—Freenet is a loosely structured system cal data store, that it makes available to
that uses file and node identifier similar- the network for reading and writing, as
ity to produce an estimate of where a file well as a dynamic routing table contain-
may be located, and a chain mode propa- ing the addresses of other nodes and the
gation approach to forward queries from files they are thought to hold. To search
node-to-node. for a file, the user sends a request message
—Chord is a system whose nodes maintain specifying the key and a timeout (hops-to-
a distributed routing table in the form of live) value.
an identifier circle on which all nodes are Freenet uses the following types of mes-
mapped, and an associated finger table. sages, which all include the node identifier
ACM Computing Surveys, Vol. 36, No. 4, December 2004.
16. 350 S. Androutsellis-Theotokis and D. Spinellis
(for loop detection), a hops-to-live value If a node receives a request for a locally-
(similar to the Gnutella TTL, see Sec- stored file, the search stops and the data
tion 3.3.2), and the source and destination is forwarded back to the requestor.
node identifiers: If the node does not store the file that the
requestor is looking for, it forwards the re-
Data insert. A node inserts new data
quest to its neighbor that is most likely to
in the network. A key and the actual data
have the file, by searching for the file key
(file) are included.
in its local routing table that is closest to
Data request. A request for a certain
the one requested. The messages, there-
file. The key of the file requested is also
fore, form a chain, as they propagate from
included.
node-to-node. To avoid huge chains, mes-
Data reply. A reply initiated when the
sages are deleted after passing through
requested file is located. The actual file is
a certain number of nodes, based on the
also included in the reply message.
hops-to-live value they carry. Nodes also
Data failed. A failure to locate a file.
store the ID and other information of the
The location (node) of the failure and the
requests they have seen, in order to handle
reason are also included.
“data reply” and “data failed” messages.
New nodes join the Freenet network If a node receives a backtracking “data
by first discovering the address of one failed” message from a downstream node,
or more existing nodes, and then start- it selects the next best node from its rout-
ing to send Data Insert messages. To in- ing stack and forwards the request to it.
sert a new file in the network, the node If all nodes in the routing table have been
first calculates a binary key for the file, explored in this way and failed, it sends
and then sends a data insert message to back a “data failed” message to the node
itself. Any node that receives the insert from which it originally received the data
message, first checks to see if the key is request message.
already taken. If the key is not found, the If the requested file is eventually found
node looks up the closest key (in terms at a certain node, a reply is passed back
of lexicographic distance) in its routing through each node that forwarded the re-
table, and forwards the insert message quest to the original node that started
to the corresponding node. By this mech- the chain. This data reply message will
anism, newly inserted files are placed include the actual data, which is cached
at nodes possessing files with similar in all intermediate nodes for future re-
keys. quests. A subsequent request with the
This continues as long as the hops-to- same key will be served immediately with
live limit is not reached. In this way, more the cached data. A request for a similar
than one node will store the new file. At key will be forwarded to the node that pre-
the same time, all the participating nodes viously provided the data.
will update their routing tables with the To address the problem of obtaining
new information (this is the mechanism the key that corresponds to a specific file,
through which the new nodes announce Freenet recommends the use of a special
their presence to the rest of the network). class of lightweight files called “indirect
If the hops-to-live limit is reached without files”. When a real file is inserted, the au-
any key collision, an “all clear” result thor also inserts a number of indirect files
will be propagated back to the original that are named according to search key-
inserter, informing that the insert was words and contain pointers to the real file.
successful. These indirect files differ from normal files
If the key is found to be taken, the node in that multiple files with the same key
returns the preexisting file as if a request (i.e. search keyword) are permitted to ex-
were made for it. In this way, malicious ist, and requests for such keys keep going
attempts to supplant existing files by in- until a specified number of results is ac-
serting junk will result in the existing files cumulated, instead of stopping at the first
being spread further. file found. The problem of managing the
ACM Computing Surveys, Vol. 36, No. 4, December 2004.
17. A Survey of Content Distribution Technologies 351
Fig. 4. The use of indirect files in Freenet.
The diagram illustrates a regular file with key
“A652D17D88”, which is supposed to contain a tu-
torial about photography. The author chose to in-
sert the file itself, and then a set of indirect files
(marked with an i), named according to search key- Fig. 5. A Chord identifier circle consisting of the
words she considered relevant. The indirect files three nodes 0,1 and 3. In this example, key 1 is lo-
are distributed among the nodes of the network; cated at node 1, key 2 at node 3, and key 6 at node 0.
they do not contain the actual document, simply a
“pointer” to the location of the regular file contain-
ing the document. [Karger et al. 1997]. All node identifiers
are ordered in an “identifier circle” modulo
large volume of such indirect files remains 2m (Figure 5 shows an identifier circle with
open. Figure 4 illustrates the use of indi- m = 3). Key k is assigned to the first node
rect files. whose identifier is equal to, or follows k, in
The following properties of Freenet are the identifier space. This node is called the
a result of its routing and location algo- successor node of key k. The use of consis-
rithms: tent hashing tends to balance load, as each
node receives roughly the same number of
—Nodes tend to specialize in searching keys.
for similar keys over time, as they get The only routing information required
queries from other nodes for similar is for each node to be aware of its succes-
keys. sor node on the circle. Queries for a given
—Nodes store similar keys over time, due key are passed around the circle via these
to the caching of files as a result of suc- successor pointers until a node that con-
cessful queries. tains the key is encountered. This is the
—Similarity of keys does not reflect simi- node the query maps to.
larity of files. When a new node n joins the network,
—Routing does not reflect the underlying certain keys previously assigned to n’s suc-
network topology. cessor will become assigned to n. When
node n leaves the network, all keys as-
3.4.2. Chord. Chord [Stoica et al. 2001] signed to it will be reassigned to its suc-
is a peer-to-peer routing and location in- cessor. These are the only changes in key
frastructure that performs a mapping of assignments that need to take place in or-
file identifiers onto node identifiers. Data der to maintain load balance.
location can be implemented on top of As discussed, only one data element
Chord by identifying data items (files) per node needs to be correct for Chord to
with keys and storing the (key, data item) guarantee correct (though slow) routing of
pairs at the node that the keys map to. queries. Performance degrades gracefully
In Chord, nodes are also identified by when routing information becomes out of
keys. The keys are assigned both to files date due to nodes joining and leaving the
and nodes by means of a deterministic system, and availability remains high only
function, a variant of consistent hashing as long as nodes fail independently. Since
ACM Computing Surveys, Vol. 36, No. 4, December 2004.
18. 352 S. Androutsellis-Theotokis and D. Spinellis
Fig. 6. CAN: (a) Example 2-d [0, 1] × [0, 1] coordinate space partitioned
between 5 CAN nodes; (b) Example 2-d space after node F joins.
the overlay topology is not based on the by supporting the insertion, lookup, and
underlying physical IP network topology, a deletion of (key,value) pairs in the table.
single failure in the IP network may man- Each individual node of the CAN net-
ifest itself as multiple, scattered link fail- work stores a part (referred to as a “zone”)
ures in the overlay [Saroiu et al. 2002]. of the hash table, as well as information
To increase the efficiency of the loca- about a small number of adjacent zones
tion mechanism described previously, that in the table. Requests to insert, lookup, or
may, in the worst case, require traversing delete a particular key are routed via in-
all N nodes to find a certain key, Chord termediate zones to the node that main-
maintains additional routing information, tains the zone containing the key.
in the form of a “finger table”. In this table, CAN uses a virtual d -dimensional
each entry i points to the successor of node Cartesian coordinate space (see Figure 6)
n + 2i. For a node n to perform a lookup to store (key K ,value V ) pairs. The zone
for key k, the finger table is consulted to of the hash table that a node is respon-
identify the highest node n whose ID is sible for corresponds to a segment of this
between n and k. If such a node exists, the coordinate space. Any key K is, therefore,
lookup is repeated starting from n . Oth- deterministically mapped onto a point P
erwise, the successor of n is returned. Us- in the coordinate space. The (K , V ) pair is
ing the finger table, both the amount of then stored at the node that is responsible
routing information maintained by each for the zone within which point P lies. For
node and the time required for resolving example, in the case of Figure 6(a), a key
lookups are O(logN ) for an N -node sys- that maps to coordinate (0.1,0.2) would be
tem in the steady state. stored at the node responsible for zone B.
Achord [Hazel and Wiley 2002] is pro- To retrieve the entry corresponding to
posed as a variant of Chord that pro- K , any node can apply the same determin-
vides censorship resistance by limiting istic function to map K to P , and then re-
each node’s knowledge of the network in trieve the corresponding value V from the
ways similar to Freenet (see Section 3.4.1). node covering P . Unless P happens to lie
in the requesting node’s zone, the request
3.4.3. CAN. The CAN (“Content must be routed from node-to-node until it
Addressable Network”) [Ratnasamy reaches the node covering P .
et al. 2001] is essentially a distributed, CAN nodes maintain a routing table
Internet-scale hash table that maps file containing the IP addresses of nodes that
names to their location in the network, hold zones adjoining their own, to enable
ACM Computing Surveys, Vol. 36, No. 4, December 2004.
19. A Survey of Content Distribution Technologies 353
routing between arbitrary points in space. advanced routing metrics, such as connec-
Intuitively, routing in CAN works by fol- tion latency and underlying IP topology,
lowing the straight line path through the alongside the Cartesian distance between
Cartesian space from source to destination source and destination; allowing multiple
coordinates. For example, in Figure 6(a), a nodes to share the same zone, mapping the
request from node A for a key mapping to same key onto different points; and the ap-
point p would be routed though nodes A, plication of caching and replication tech-
B, E, along the straight line represented niques [Ratnasamy et al. 2001].
by the arrow.
A new node that joins the CAN system is
allocated its own portion of the coordinate 3.4.4. Tapestry. Tapestry [Zhao et al.
space by splitting the allocated zone of an 2001] supports the location of objects and
existing node in half, as follows: the routing of messages to objects (or the
closest copy of them, if more than one copy
(1) The new node identifies a node al- exist in the network) in a distributed, self-
ready in CAN network, using a boot- administering, and fault-tolerant manner,
strap mechanism as described in Fran- offering system-wide stability by bypass-
cis [2000]. ing failed routes and nodes, and rapidly
(2) Using the CAN routing mechanism, adapting communication topologies to cir-
the node randomly chooses a point P cumstances.
in the coordinate space and sends a The topology of the network is self-
JOIN request to the node covering P . organizing as nodes come and go, and net-
The zone is then split, and half of it is work latencies vary. The routing and loca-
assigned to the new node. tion information is distributed among the
(3) The new node builds its routing table network nodes; the topology’s consistency
with the IP addresses of its new neigh- is checked on-the-fly, and if it is lost due to
bors, and the neighbors of the split failures or destroyed, it is easily rebuilt or
zone are also notified to update their refreshed.
routing tables to include the new node. Tapestry is based on the location and
routing mechanisms introduced by Plax-
When nodes gracefully leave CAN, the ton, Rajamaran and Richa [1997], in which
zones they occupy and the associated they present the Plaxton mesh, a dis-
hash table entries are explicitly handed tributed data structure that allows nodes
over to one of their neighbors. Under to locate objects and route messages to
normal conditions, a node sends periodic them across an arbitrarily-sized overlay
update messages to each of its neighbors network, while using routing maps of
reporting its zone coordinates, its list of small and constant size. In the original
neighbors, and their zone coordinates. If Plaxton mesh, the nodes can take on the
there is prolonged absence of such an up- role of servers (where objects are stored),
date message, the neighbor nodes realize routers (which forward messages), and
there has been a failure, and initiate a con- clients (origins of requests).
trolled takeover mechanism. If many of Each node maintains a neighbor map,
the neighbors of a failed node also fail, an as shown in the example in Table IV. The
expanding ring search mechanism is ini- neighbor map has multiple levels, each
tiated by one of the neighboring nodes, to level l containing pointers to nodes whose
identify any functioning nodes outside the ID must be matched with l digits (the x’s
failure region. represent wildcards). Each entry in the
A list of design improvements is pro- neighbor map corresponds to a pointer to
posed over the basic CAN design described the closest node in the network whose ID
including the use of multi-dimensional matches the number in the neighbor map,
coordinate space, or multiple coordinate up to a digit position.
spaces, for improving network latency and For example, the 5th entry for the 3rd
fault tolerance; the employment of more level for node 67493 points to the node
ACM Computing Surveys, Vol. 36, No. 4, December 2004.
20. 354 S. Androutsellis-Theotokis and D. Spinellis
Table IV. The Neighbor Map Held by Tapestry Node
With ID 67493
Level 5 Level 4 Level 3 Level 2 Level 1
Entry 0 07493 x0493 xx093 xxx03 xxxx0
Entry 1 17493 x1493 xx193 xxx13 xxxx1
Entry 2 27493 x2493 xx293 xxx23 xxxx2
Entry 3 37493 x3493 xx393 xxx33 xxxx3
Entry 4 47493 x4493 xx493 xxx43 xxxx4
Entry 5 57493 x5493 xx593 xxx53 xxxx5
Entry 6 67493 x6493 xx693 xxx63 xxxx6
Entry 7 77493 x7493 xx793 xxx73 xxxx7
Entry 8 87493 x8493 xx893 xxx83 xxxx8
Entry 9 97493 x9493 xx993 xxx93 xxxx9
Each entry in this table corresponds to a pointer to
another node.
closest to 67493 in network distance whose
ID ends in ..593. Table IV shows an exam-
Fig. 7. Tapestry: Plaxton mesh routing ex-
ple neighbor map maintained by a node ample, showing the path taken by a message
with ID 67493. originating from node 67493, and destined for
Messages are, therefore, incrementally node 34567 in a Plaxton mesh, using decimal
routed to the destination node digit-by- digits of length 5.
digit, from the right to the left. Figure 7
shows an example path taken by a mes- required for assigning and identifying root
sage from node with I D = 67493, to node nodes, and (2) the vulnerability of the root
I D = 34567. The digits are resolved right nodes.
to left as follows: The Plaxton mesh assumes a static node
population. Tapestry extends its design to
xxxx7 → xxx67 → xx567 → x4567 adapt it to the transient populations of
peer-to-peer networks and provide adapt-
→ 34567 ability, fault tolerance, as well as vari-
ous optimizations described in Zhao et al.
The Plaxton mesh uses a root node for [2001] and outlined below:
each object, that serves to provide a guar-
anteed node from which the object can —Each node additionally maintains a list
be located. When an object o is inserted of back-pointers, which point to nodes
in the network and stored at node ns , where it is referred to as a neighbor.
a root node nr is assigned to it by us- These are used in dynamic node inser-
ing a globally consistent deterministic al- tion algorithms to generate the appro-
gorithm. A message is then routed from priate neighbor maps for new nodes. Dy-
ns to nr , storing data in the form of a namic algorithms are employed for node
mapping (object id o, storer id ns ) at all insertion, populating neighbor maps,
nodes along the way. During a location and notifying neighbors of new node in-
query, messages destined for o are ini- sertions.
tially routed towards nr , until a node is —The concept of distance between nodes
encountered containing the (o, ns ) location becomes semantically more flexible, and
mapping. locations of more than one replica of
The Plaxton mesh offers: (1) simple an object are stored, allowing the ap-
fault-handling by its potential to route plication architecture to define how the
around a single link or node by choos- “closest” node will be interpreted. For
ing a node with a similar suffix, and (2) example, in the Oceanstore architecture
scalability (with the only bottleneck exist- [Kubiatowicz et al. 2000], a “freshness”
ing at the root nodes). Its limitations in- metric is incorporated in the concept
clude: (1) the need for global knowledge of distance, that is taken into account
ACM Computing Surveys, Vol. 36, No. 4, December 2004.
21. A Survey of Content Distribution Technologies 355
when finding the closest replica of a doc- availability to adapt to regional outages
ument. and denial of service attacks.
—The use of soft-state to maintain cached Pastry [Rowstron and Druschel 2001]
content, based on the announce/listen is a scheme very similar to Tapestry,
approach [Deering 1998], is adopted by differing mainly in its approach to
Tapestry to detect, circumvent, and re- achieving network locality and object
cover from failures in routing or ob- replication. It is employed by the PAST
ject location. Caches are periodically [Druschel and Rowstron 2001] large-
updated by refreshment messages, or scale persistent peer-to-peer storage
purged if no such messages are re- utility.
ceived. Additionally, the neighbor map Finally Kademlia [Mayamounkov and
is extended to maintain two backup Mazieres 2002] is proposed as an improved
neighbors, in addition to the closest (pri- XOR-topology-based routing algorithm
mary) neighbor. Furthermore, to avoid similar to Tapestry and Pastry, focusing
costly reinsertions of nodes after fail- on consistency and performance. It intro-
ures, when a node realizes that a neigh- duces the use of a concurrence parameter,
bor is unreachable, instead of removing that lets users trade bandwidth for better
its pointer, it temporarily marks it as in- latency and fault recovery.
valid in the hope that the failure will be
repaired, and, in the meantime, routes 3.4.5. Comparison, Shortcomings and Evo-
messages through alternative paths. lution of Structured Architectures. In the
—To avoid the single point of failure that previous sections, we have presented
root nodes constitute, Tapestry assigns characteristic examples of structured
multiple roots to each object. This en- peer-to-peer object location and routing
ables a trade-off between reliability and systems, and their main characteristics.
redundancy. A distributed algorithm A comparison of these (and similar) sys-
called surrogate routing is employed to tems can be based on the size of the
compute a unique root node for an object routing information maintained by each
in a globally consistent fashion, given node (the routing table), their search
the nonstatic set of nodes in the network and retrieval performance (measured in
—A set of optimizations improve per- number of hops), and the flexibility with
formance by adapting to environment which they can adapt to changing network
changes. Tapestry nodes tune their topologies.
neighbor pointers by running refresher It turns out that in their basic de-
threads that update network latency sign, most of these systems are equiva-
values. Algorithms are implemented to lent in terms of routing table space cost,
detect query hotspots and offer sugges- which is O(log N ), where N is the size of
tions as to where additional copies of network. Similarly, their performance is
objects can be placed to significantly im- again mostly O(log N ), with the exception
prove query response time. A “hotspot of CAN, where the performance is given by
1
cache” is also maintained at each node. O( d N d ), d being the number of employed
4
dimensions.
Tapestry is used by several systems Maintaining the routing table in the
such as Oceanstore [Kubiatowicz et al. face of transient node populations is rela-
2000; Rhea et al. 2001], Mnemosyne tively costly for all systems, perhaps more
[Hand and Roscoe 2002], and Scan [Chen for Chord, as nodes joining or leaving in-
et al. 2000]. duce changes to all other nodes, and less
Oceanstore further enhances the per- so in Kademlia which maintains a more
formance and fault tolerance of Tapestry flexible routing table. Liben-Nowell et al.
by applying an additional “introspec- [2002a] introduce the notion of the “half-
tion layer”, where nodes monitor usage life” of a system as an approach to this
patterns, network activity, and resource issue.
ACM Computing Surveys, Vol. 36, No. 4, December 2004.