Nubilum: Resource Management System for Distributed Clouds

Pós-Graduação em Ciência da Computação
“Nubilum: Resource Management System for
Distributed Clouds”
Por
Glauco Estácio Gonçalves
Tese de Doutorado
Universidade Federal de Pernambuco
posgraduacao@cin.ufpe.br
www.cin.ufpe.br/~posgraduacao
RECIFE, 03/2012

UNIVERSIDADE FEDERAL DE PERNAMBUCO
CENTRO DE INFORMÁTICA
PÓS-GRADUAÇÃO EM CIÊNCIA DA COMPUTAÇÃO
GLAUCO ESTÁCIO GONÇALVES
“Nubilum: Resource Management System for Distributed Clouds"
ESTE TRABALHO FOI APRESENTADO À PÓS-GRADUAÇÃO EM
CIÊNCIA DA COMPUTAÇÃO DO CENTRO DE INFORMÁTICA DA
UNIVERSIDADE FEDERAL DE PERNAMBUCO COMO REQUISITO
PARCIAL PARA OBTENÇÃO DO GRAU DE DOUTOR EM CIÊNCIA
DA COMPUTAÇÃO.
ORIENTADORA: Dra. JUDITH KELNER
CO-ORIENTADOR: Dr. DJAMEL SADOK
RECIFE, MARÇO/2012

Tese de Doutorado apresentada por Glauco Estácio Gonçalves à Pós- Graduação em
Ciência da Computação do Centro de Informática da Universidade Federal de
Pernambuco, sob o título “Nubilum: Resource Management System for Distributed
Clouds” orientada pela Profa. Judith Kelner e aprovada pela Banca Examinadora
formada pelos professores:
___________________________________________________________
Prof. Paulo Romero Martins Maciel
Centro de Informática / UFPE
__________________________________________________________
Prof. Stênio Flávio de Lacerda Fernandes
____________________________________________________________
Prof. Kelvin Lopes Dias
_________________________________________________________
Prof. José Neuman de Souza
Departamento de Computação / UFC
___________________________________________________________
Profa. Rossana Maria de Castro Andrade
Departamento de Computação / UFC
Visto e permitida a impressão.
Recife, 12 de março de 2012.
___________________________________________________
Prof. Nelson Souto Rosa
Coordenador da Pós-Graduação em Ciência da Computação do
Centro de Informática da Universidade Federal de Pernambuco.

To my family Danielle, João
Lucas, and Catarina.

iv
Acknowledgments
I would like to express my gratitude to God, cause of all the things and also my
existence; and to the Blessed Virgin Mary to whom I appealed many times in prayer, being
attended always.
I would like to thank my advisor Dr. Judith Kelner and my co-advisor Dr. Djamel
Sadok, whose expertise and patience added considerably to my doctoral experience. Thanks
for the trust in my capacity to conduct my doctorate at GPRT (Networks and
Telecommunications Research Group).
I am indebted to all the people from GPRT for their invaluable help for this work. A
very special thanks goes out to Patrícia, Marcelo, and André Vítor, which have given
valuable comments over the course of my PhD.
I must also acknowledge my committee members, Dr. Jose Neuman, Dr. Otto Duarte,
Dr. Rossana Andrade, Dr. Stênio Fernandes, Dr. Kelvin Lopes, and Dr. Paulo Maciel for
reviewing my proposal and dissertation, offering helpful comments to improve my work.
I would like to thank my wife Danielle for her prayer, patience, and love which gave
me the necessary strength to finish this work. A special thanks to my children, João Lucas
and Catarina. They are gifts of God that make life delightful.
Finally, I would like to thank my parents, João and Fátima, and my sisters, Cynara and
Karine, for their love. Their blessings have always been with me as I followed in my doctoral
research.

v
Abstract
The current infrastructure of Cloud Computing providers is composed of networking and
computational resources that are located in large datacenters supporting as many as
hundreds of thousands of diverse IT equipment. In such scenario, there are several
management challenges related to the energy, failure and operational management and
temperature control. Moreover, the geographical distance between resources and final users
is a source of delay when accessing the services. An alternative to such challenges is the
creation of Distributed Clouds (D-Clouds) with geographically distributed resources along to
a network infrastructure with broad coverage.
Providing resources in such a distributed scenario is not a trivial task, since, beyond the
processing and storage resources, network resources must be taken in consideration offering
users a connectivity service for data transportation (also called Network as a Service – NaaS).
Thereby, the allocation of resources must consider the virtualization of servers and the
network devices. Furthermore, the resource management must consider all steps from the
initial discovery of the adequate resource for attending developers’ demand to its final
delivery to the users.
Considering those challenges in resource management in D-Clouds, this Thesis
proposes then Nubilum, a system for resource management on D-Clouds considering geo-
locality of resources and NaaS aspects. Through its processes and algorithms, Nubilum
offers solutions for discovery, monitoring, control, and allocation of resources in D-Clouds
in order to ensure the adequate functioning of the D-Cloud while meeting developers’
requirements. Nubilum and its underlying technologies and building blocks are described
and their allocation algorithms are also evaluated to verify their efficacy and efficiency.
Keywords: cloud computing, resource management mechamisms, network virtualization.

vi
Resumo
Atualmente, a infraestrutura dos provedores de computação em Nuvem é composta por
recursos de rede e de computação, que são armazenados em datacenters de centenas de
milhares de equipamentos. Neste cenário, encontram-se diversos desafios quanto à gerência
de energia e controle de temperatura, além de, devido à distância geográfica entre os recursos
e os usuários, ser fonte de atraso no acesso aos serviços. Uma alternativa a tais desafios é o
uso de Nuvens Distribuídas (Distributed Clouds – D-Clouds) com recursos distribuídos
geograficamente ao longo de uma infraestrutura de rede com cobertura abrangente.
Prover recursos em tal cenário distribuído não é uma tarefa trivial, pois, além de
recursos computacionais e de armazenamento, deve-se considerar recursos de rede os quais
são oferecidos aos usuários da nuvem como um serviço de conectividade para transporte de
dados (também chamado Network as a Service – NaaS). Desse modo, o processo de alocação
deve considerar a virtualização de ambos, servidores e elementos de rede. Além disso, a
gerência de recursos deve considerar desde a descoberta dos recursos adequados para
atender as demandas dos usuários até a manutenção da qualidade de serviço na sua entrega
final.
Considerando estes desafios em gerência de recursos em D-Clouds, este trabalho
propõe Nubilum: um sistema para gerência de recursos em D-Cloud que considera aspectos
de geo-localidade e NaaS. Por meio de seus processos e algoritmos, Nubilum oferece
soluções para descoberta, monitoramento, controle e alocação de recursos em D-Clouds de
forma a garantir o bom funcionamento da D-Cloud, além de atender os requisitos dos
desenvolvedores. As diversas partes e tecnologias de Nubilum são descritos em detalhes e
suas funções delineadas. Ao final, os algoritmos de alocação do sistema são também
avaliadas de modo a verificar sua eficácia e eficiência.
Palavras-chave: computação em nuvem, mecanismos de alocação de recursos, virtualização
de redes.

vii
Contents
Abstract v
Resumo vi
Abbreviations and Acronyms xii
1 Introduction 1
1.1 Motivation............................................................................................................................................. 2
1.2 Objectives ............................................................................................................................................. 4
1.3 Organization of the Thesis................................................................................................................. 4
2 Cloud Computing 6
2.1 What is Cloud Computing?................................................................................................................ 6
2.2 Agents involved in Cloud Computing.............................................................................................. 7
2.3 Classification of Cloud Providers...................................................................................................... 8
2.3.1 Classification according to the intended audience..................................................................................8
2.3.2 Classification according to the service type.............................................................................................8
2.3.3 Classification according to programmability.........................................................................................10
2.4 Mediation System............................................................................................................................... 11
2.5 Groundwork Technologies.............................................................................................................. 12
2.5.1 Service-Oriented Computing...................................................................................................................12
2.5.2 Server Virtualization..................................................................................................................................12
2.5.3 MapReduce Framework............................................................................................................................13
2.5.4 Datacenters.................................................................................................................................................14
3 Distributed Cloud Computing 15
3.1 Definitions.......................................................................................................................................... 15
3.2 Research Challenges inherent to Resource Management ............................................................ 18
3.2.1 Resource Modeling....................................................................................................................................18
3.2.2 Resource Offering and Treatment..........................................................................................................20
3.2.3 Resource Discovery and Monitoring......................................................................................................22
3.2.4 Resource Selection and Optimization....................................................................................................23
3.2.5 Summary......................................................................................................................................................27
4 The Nubilum System 28
4.1 Design Rationale................................................................................................................................ 28
4.1.1 Programmability.........................................................................................................................................28
4.1.2 Self-optimization........................................................................................................................................29
4.1.3 Existing standards adoption.....................................................................................................................29
4.2 Nubilum’s conceptual view.............................................................................................................. 29
4.2.1 Decision plane............................................................................................................................................30
4.2.2 Management plane.....................................................................................................................................31
4.2.3 Infrastructure plane...................................................................................................................................32
4.3 Nubilum’s functional components.................................................................................................. 32
4.3.1 Allocator......................................................................................................................................................33
4.3.2 Manager.......................................................................................................................................................34

viii
4.3.3 Worker.........................................................................................................................................................35
4.3.4 Network Devices.......................................................................................................................................36
4.3.5 Storage System ...........................................................................................................................................37
4.4 Processes............................................................................................................................................. 37
4.4.1 Initialization processes..............................................................................................................................37
4.4.2 Discovery and monitoring processes......................................................................................................38
4.4.3 Resource allocation processes..................................................................................................................39
4.5 Related projects.................................................................................................................................. 40
5 Control Plane 43
5.1 The Cloud Modeling Language ....................................................................................................... 43
5.1.1 CloudML Schemas.....................................................................................................................................45
5.1.2 A CloudML usage example......................................................................................................................52
5.1.3 Comparison and discussion .....................................................................................................................56
5.2 Communication interfaces and protocols...................................................................................... 57
5.2.1 REST Interfaces.........................................................................................................................................57
5.2.2 Network Virtualization with Openflow.................................................................................................63
5.3 Control Plane Evaluation ................................................................................................................. 65
6 Resource Allocation Strategies 68
6.1 Manager Positioning Problem ......................................................................................................... 68
6.2 Virtual Network Allocation.............................................................................................................. 70
6.2.1 Problem definition and modeling ...........................................................................................................72
6.2.2 Allocating virtual nodes............................................................................................................................74
6.2.3 Allocating virtual links...............................................................................................................................75
6.2.4 Evaluation...................................................................................................................................................76
6.3 Virtual Network Creation................................................................................................................. 81
6.3.1 Minimum length Steiner tree algorithms ...............................................................................................82
6.3.2 Evaluation...................................................................................................................................................86
6.4 Discussion........................................................................................................................................... 89
7 Conclusion 91
7.1 Contributions ..................................................................................................................................... 92
7.2 Publications ........................................................................................................................................ 93
7.3 Future Work ....................................................................................................................................... 94
References 96

ix
List of Figures
Figure 1 Agents in a typical Cloud Computing scenario (from [24]) ..................................................7
Figure 2 Classification of Cloud types (from [71]).................................................................................9
Figure 3 Components of an Archetypal Cloud Mediation System (adapted from [24]) ................11
Figure 4 Comparison between (a) a current Cloud and (b) a D-Cloud............................................16
Figure 5 ISP-based D-Cloud example ...................................................................................................17
Figure 6 Nubilum’s planes and modules...............................................................................................30
Figure 7 Functional components of Nubilum......................................................................................33
Figure 8 Schematic diagram of Allocator’s modules and relationships with other components..33
Figure 9 Schematic diagram of Manager’s modules and relationships with other components...34
Figure 10 Schematic diagram of Worker modules and relationships with the server system........35
Figure 11 Link discovery process using LLDP and Openflow ..........................................................38
Figure 12 Sequence diagram of the Resource Request process for a developer..............................39
Figure 13 Integration of different descriptions using CloudML........................................................44
Figure 14 Basic status type used in the composition of other types..................................................45
Figure 15 Type for reporting status of the virtual nodes ....................................................................46
Figure 16 XML Schema used to report the status of the physical node...........................................46
Figure 17 Type for reporting complete description of the physical nodes.......................................46
Figure 18 Type for reporting the specific parameters of any node ...................................................47
Figure 19 Type for reporting information about the physical interface ...........................................48
Figure 20 Type for reporting information about a virtual machine..................................................48
Figure 21 Type for reporting information about the whole infrastructure ......................................49
Figure 22 Type for reporting information about the physical infrastructure...................................49
Figure 23 Type for reporting information about a physical link .......................................................50
Figure 24 Type for reporting information about the virtual infrastructure .....................................50
Figure 25 Type describing the service offered by the provider .........................................................51
Figure 26 Type describing the requirements that can be requested by a developer .......................52
Figure 27 Example of a typical Service description XML ..................................................................53
Figure 28 Example of a Request XML..................................................................................................53
Figure 29 Physical infrastructure description........................................................................................54
Figure 30 Virtual infrastructure description..........................................................................................55
Figure 31 Communication protocols employed in Nubilum..............................................................57
Figure 32 REST operation for the retrieval of service information..................................................59
Figure 33 REST operation for updating information of a service ....................................................59
Figure 34 REST operation for requesting resources for a new application.....................................59
Figure 35 REST operation for changing resources of a previous request .......................................60
Figure 36 REST operation for releasing resources of an application ...............................................60
Figure 37 REST operation for registering a new Worker...................................................................60
Figure 38 REST operation to unregister a Worker..............................................................................61
Figure 39 REST operation for update information of a Worker ......................................................61
Figure 40 REST operation for retrieving a description of the D-Cloud infrastructure .................61
Figure 41 REST operation for updating the description of a D-Cloud infrastructure...................61
Figure 42 REST operation for the creation of a virtual node............................................................62
Figure 43 REST operation for updating a virtual node ......................................................................62
Figure 44 REST operation for removal of a virtual node...................................................................62
Figure 45 REST operation for requesting the discovered physical topology ..................................63
Figure 46 REST operation for the creation of a virtual link ..............................................................63
Figure 47 REST operation for updating a virtual link.........................................................................64
Figure 48 REST operation for removal of a virtual link.....................................................................64

x
Figure 49 Example of a typical rule for ARP forwarding...................................................................65
Figure 50 Example of the typical rules created for virtual links: (a) direct, (b) reverse..................65
Figure 51 Example of a D-Cloud with ten workers and one Manager.............................................69
Figure 52 Algorithm for allocation of virtual nodes............................................................................74
Figure 53 Example illustrating the minimax path................................................................................75
Figure 54 Algorithm for allocation of virtual links..............................................................................76
Figure 55 The (a) old and (b) current network topologies of RNP used in simulations................77
Figure 56 Results for the maximum node stress in the (a) old and (b) current RNP topology....78
Figure 57 Results for the maximum link stress in the (a) old and (b) current RNP topology ......79
Figure 58 Results for the mean link stress in the (a) old and (b) current RNP topology...............80
Figure 59 Mean path length (a) old and (b) current RNP topology..................................................80
Figure 60 Example creating a virtual network: (a) before the creation; (b) after the creation ......81
Figure 61 Search procedure used by the GHS algorithm....................................................................83
Figure 62 Placement procedure used by the GHS algorithm.............................................................84
Figure 63 Example of the placement procedure: (a) before and (b) after placement.....................85
Figure 64 Percentage of optimal samples for GHS and STA in the old RNP topology................87
Figure 65 Percentage of samples reaching relative error ≤ 5% in the old RNP topology.............88
Figure 66 Percentage of optimal samples for GHS and STA in the current RNP topology ........88
Figure 67 Percentage of samples reaching relative error ≤ 5% in the current RNP topology......89

xi
List of Tables
Table I Summary of the main aspects discussed..................................................................................27
Table II MIMEtypes used in the overall communications.................................................................58
Table III Models for the length of messages exchanged in the system in bytes.............................67
Table IV Characteristics present in Nubilum’s resource model ........................................................71
Table V Reduced set of characteristics considered by the proposed allocation algorithms ..........72
Table VI Factors and levels used in the MPA’s evaluation ................................................................78
Table VII Factors and levels used in the GHS’s evaluation...............................................................86
Table VIII Scientific papers produced ..................................................................................................94

xii
Abbreviations and Acronyms
CDN Content Delivery Network
CloudML Cloud Modeling Language
D-Cloud Distribute Cloud
DHCP Dynamic Host Configuration Protocol
GHS Greedy Hub Selection
HTTP Hypertext Transfer Protocol
IaaS Infrastructure as a Service
ISP Internet Service Provider
LLDP Link Layer Discovery Protocol
MPA Minimax Path Algorithm
MPP Manager Positioning Problem
NaaS Network as a Service
NV Network Virtualization
OA Optimal Algorithm
OCCI Open Cloud Computing Interface
PoP Point of Presence
REST Representational state transfer
RP Replica Placement
RPA Replica Placement Algorithm
STA Steiner Tree Approximation
VM Virtual Machine
VN Virtual Network
XML Extensible Markup Language
ZAA Zhu and Ammar Algorithm

1
1 Introduction
“A inea incipere.”
Erasmus
Nowadays, it is common to access content across the Internet with little reference to the underlying
datacenter hosting infrastructure maintained by content providers. The entire technology used to
provide such level of locality transparency offers also a new model for the provisioning of
computing services, known as Cloud Computing. This model is attractive as it allows resources to be
provisioned according to users’ requirements leading to overall cost reduction. Cloud users can rent
resources as they become necessary, in a much more scalable and elastic way. Moreover, such users
can transfer operational risks to cloud providers. In the viewpoint of those providers, the model
offers a way for a better utilization of their own infrastructure. Ambrust et al. [1] point out that this
model benefits from a form of statistical multiplexing, since it allocates resources for several users
concurrently on a demand basis. This statistical multiplexing of datacenters is subsequent to several
decades of research in many areas such as distributed computing, Grid computing, web
technologies, service computing, and virtualization.
Current Cloud Computing providers mainly use large and consolidated datacenters in order to
offer their services. However, the ever increasing need for over-provisioning to attend peak
demands and providing redundancy against failures allied to expensive cooling needs are important
factors increasing the energetic costs of centralized datacenters [62]. In current datacenters, the
cooling technologies used for heat dissipation control accounts for as much as 50% of the total
power consumption [38]. In addition to these aspects, it must be observed that the network between
users and the Cloud is often an unreliable best-effort IP service, which can harm delay-constrained
services and interactive applications.
To deal with these problems, there have been some indicatives whereby small cooperative
datacenters can be more attractive since they offer cheaper and low-power consumption alternative
reducing the infrastructure costs of centralized Clouds [12]. These small datacenters can be built at
different geographical regions and connected by dedicated or public (provided by Internet Service
Providers) networks, configuring a new type of Cloud, referred to as a Distributed Cloud. Such

2
Distributed Clouds [20], or just D-Clouds, can exploit the possibility of (virtual) links creation and
the potential of sharing resources across geographic boundaries to provide latency-based allocation
of resources to fully utilize this emerging distributed computing power. D-Clouds can reduce
communication costs by simply provisioning storage, servers, and network resources close to end-
users.
The D-Clouds can be considered as an additional step in the ongoing deployments of Cloud
Computing: one that supports different requirements and leverages new opportunities for service
providers. Users in a Distributed Cloud will be free to choose where to allocate their resources in
order to attend specific market niches, constraints on jurisdiction of software and data, or quality of
service aspects of their clients.
1.1 Motivation
Similarly to Cloud Computing, one of the most important design aspects of D-Clouds is the
availability of “infinite” computing resources which may be used on demand. Cloud users see this
“infinite” resource pool because the Cloud offers the continuous monitoring and management of its
resources and the allocation of resources in an elastic way. Nevertheless, providing on-demand
computing instances and network resources in a distributed scenario is not a trivial task. Dynamic
allocation of resources and their possible reallocation are essential characteristics for accommodating
unpredictable demands and, ultimately, contributing to investment return.
In the context of Clouds, the essential feature of any resource management system is to
guarantee that both user and provider requirements are met satisfactorily. Particularly in D-Clouds,
users may have network requirements, such as bandwidth and delay constraints, in addition to the
common computational requirements, such as CPU, memory, and storage. Furthermore, other user
requirements are relevant including node locality, topology of nodes, jurisdiction, and application
interaction.
The development of solutions to cope with resource management problems remains a very
important topic in the field of Cloud Computing. With regard to this technology, there are solutions
focused on grid computing ([49], [70]) and on datacenters in current Cloud Computing scenarios
([4]). However, such strategies do not fit well the D-Clouds as they are heavily based on assumptions
that do not hold in Distributed Cloud scenarios. For example, such solutions are designed for over-
provisioned networks and commonly do not take into consideration the cost of resources’
communication, which is an important aspect for D-Clouds that must be cautiously monitored
and/or reserved in order to meet users’ requirements.

3
The design of a resource management system involves challenges other than the specific
design of optimization algorithms for resource management. Since D-Clouds are composed of
computational and network devices with different architectures, software, and hardware capabilities,
the first challenge is the development of a suitable resource model covering all this heterogeneity
[20],. In addition, the next challenge is to describe how resources are offered, which is important
since the requirements supported by the D-Cloud provider are defined in this step. The other
challenges are related with the overall operation of the resource management system. When requests
arrive, the system should be aware of the current status of resources, in order to determine if there
are sufficient available resources in the D-Cloud that could satisfy the present request. In this way,
the right mechanisms for resource discovery and monitoring should also be designed, allowing the
system to be aware of the updated status of all its resources. Then, based on the current status and
the requirements of the request, the system may select and allocate resources to serve the present
request.
Please note that the solution to those challenges involves the fine-grained coordination of
several distributed components and the orchestrated execution of the several subsystems composing
the resource management system. At a first glance, these subsystems can be organized into three
parts: one responsible for the direct negotiation of requirements with users; another responsible for
deciding what resources to allocate for given applications; and one last part responsible for the
effective enforcement of these decisions on the resources.
Designing such system is a very interesting and challenging task, and it raises the following
research questions that will be investigated in this thesis:
1. How Cloud users describe their requirements? In order to enable the automatic
negotiation between users and the D-Cloud, the Cloud must recognize a language or
formalism for requirements description. Thus, the investigation of this topic must determine
the proper characteristics of such a language. In addition, it must verify the existent
approaches around this topic in the many relative computing areas.
2. How to represent the resources available in the Cloud? Correlated to the first question,
the resource management system must also maintain an information model to represent all
the resources in the Cloud, including their relationships (topology) and their current status.
3. How the users’ applications are mapped onto Cloud resources? This question is about
the very aspect of resource allocation, i.e., the algorithms, heuristics, and strategies that are
used to decide the set of resources meeting the applications’ requirements and optimizing a
utility function.

4
4. How to enforce the decisions made? The effective enforcement of the decisions involves
the extension of communication protocols or the development of new ones in order to
setup the state of the overall resources in the D-Cloud.
1.2 Objectives
The main objective of this Thesis is to propose an integrated solution to problems related to the
management of resources in D-Clouds. Such solution is presented as Nubilum, a resource
management system that offers a self-managed system for challenges on discovery, control,
monitoring, and allocation of resources in D-Clouds. Nimbulus provides fine-grained orchestration
of their components in order to allocate applications on a D-Cloud.
The specific goals of this Thesis are strictly related to the research questions presented in
Section 1.1, they are:
• Elaborate an information model to describe D-Cloud resources and application
requirements as computational restrictions, topology, geographic location and other
correlated aspects that can be employed to request resources directly to the D-Cloud;
• Explore and extend communication protocols for the provisioning and allocation of
computational and communication resources;
• Develop algorithms, heuristics, and strategies to find suitable D-Cloud resources based on
several different application requirements;
• Integrate the information model, the algorithms, and the communication protocols, into a
single solution.
1.3 Organization of the Thesis
This Thesis identifies the challenges involved in the resource management on Distributed Cloud
Computing and presents solutions for some of these challenges. The remainder of this document is
organized as follows.
The general concepts that make up the basis for all the other chapters are introduced in the
second chapter. Its main objective is to discuss Cloud Computing while trying to explore such
definition and to classify the main approaches in this area.
The Distributed Cloud Computing concept and several important aspects of resource
management on those scenarios are introduced in the third chapter. Moreover, this chapter will
make a comparative analysis of related research areas and problems.

5
The fourth chapter introduces the first contribution of this Thesis: the Nubilum resource
management system, which aggregates the several solutions proposed on this Thesis. Moreover, the
chapter highlights the rationale behind Nubilum as well as their main modules and components.
The fifth chapter examines and evaluates the control plane of Nubilum. It describes the
proposed Cloud Modeling Language and details the communication interfaces and protocols used
for communicating between Nubilum components.
The sixth chapter gives an overview of the resource allocation problems in Distributed
Clouds, and makes a thorough examination of the specific problems related to Nubilum. Some
particular problems are analyzed and a set of algorithms is presented and evaluated.
The seventh chapter of this Thesis reviews the obtained evaluation results, summarizes the
contributions and sets the path to future works and open issues on D-Cloud.

6
2 Cloud Computing
“Definitio est declaratio essentiae rei.”
Legal Proverb
In this chapter the main concepts of Cloud Computing will be presented. It begins with a discussion
on the definition of Cloud Computing (Section 2.1) and the main agents involved in Cloud
Computing (Section 2.2). Next, classifications of Cloud initiatives are offered in Section 2.3. An
exemplary and simple architecture of a Cloud Mediation System is presented in Section 2.4 followed
by a presentation in Section 2.5 of the main technologies acting behind the scenes of Cloud
Computing initiatives.
2.1 What is Cloud Computing?
A definition of Cloud Computing is given by the National Institute of Standards and Technology
(NIST) of the United States: “Cloud computing is a model for enabling convenient, on-demand
network access to a shared pool of configurable computing resources (e.g., networks, servers,
storage, applications, and services) that can be rapidly provisioned and released with minimal
management effort or service provider interaction” [45]. The definition says that on-demand
dynamic reconfiguration (elasticity) is a key characteristic. Additionally, the definition highlights
another Cloud Computing characteristic: it assumes that minimal management efforts are required
to reconfigure resources. In other words, the Cloud must offer self-service solutions that must
attend to requests on-demand, excluding from the scope of Cloud Computing those initiatives that
operate through the rental of computing resources in a weekly or monthly basis. Hence, it restricts
Cloud Computing to systems that provide automatic mechanisms for resource rental in real-time
with minimal human intervention.
The NIST definition gives a satisfactory concept of Cloud Computing as a computing model.
But, NIST does not cover the main object of Cloud Computing: the Cloud. Thus, in this Thesis,
Cloud Computing is defined as the computing model that operates based on Clouds. In turn, the
Cloud is defined as a conceptual layer that operates above an infrastructure to provide elastic
services in a timely manner.

7
This definition encompasses three main characteristics of Clouds. Firstly, it notes that a Cloud
is primarily a concept, i.e., a Cloud is an abstraction over an infrastructure. Thus, it is independent of
the employed technologies and therefore one can accept different setups, like Amazon EC2 or
Google App Engine, to be named Clouds. Moreover, the infrastructure is defined in a broad sense
once it can be composed by software, physical devices, and/or other Clouds. Secondly, all Clouds
have the same purpose: to provide services. This means that a Cloud hides the complexity of the
underlying infrastructure while exploring the potential of overlying services and acting as a
middleware. In addition, providing a service involves, implicitly, the use of some type of agreement
that should be guaranteed by the Cloud. Such agreements can vary from pre-defined contracts to
malleable agreements defining functional and non-functional requirements. Note that these services
are qualified as elastic ones, which has the same meaning of dynamic reconfiguration that appeared
in the NIST definition. Last but not least, the Cloud must provide services as quickly as possible
such that the infrastructure resources are allocated and reallocated to attend the users’ needs.
2.2 Agents involved in Cloud Computing
Despite previous approaches ([64], [8], [72], and [68]), this Thesis focuses only on three distinct
agents in Cloud Computing as shown in Figure 1: clients, developers, and the provider. The first
notable point is that the provider deals with two types of users that are called developers and clients.
Thus, clients are the customers of a service produced by a developer. Clients use services from
developers, but such use generates demand to the provider that actually hosts the service, and
therefore the client can also be considered a user of the Cloud. It is important to highlight that in
some scenarios (like scientific computing or batch processing) a developer may behave as a client to
the Cloud because it is the end-user of the applications. The text will use “users” when referring to
both classes without distinctions.
Figure 1 Agents in a typical Cloud Computing scenario (from [24])
Developers can be service providers, independent programmers, scientific institutions, and so
on, i.e., all who build applications into the Cloud. They create and run their applications while
Developer
Developer
Client Client Client Client

8
keeping decisions related to maintenance and management of the infrastructure to the provider.
Please note that, a priori, developers do not need to know about the technologies that makeup the
Cloud infrastructure, neither about the specific location of each item in the infrastructure.
Lastly, the term application is used to mean all types of services that can be developed on the
Cloud. In addition, it is important to note that the type of applications supported by a Cloud
depends exclusively on the goals of the Cloud as determined by the provider. Such a wide range of
possible targets generates many different types of Cloud Providers that are discussed in the next
section.
2.3 Classification of Cloud Providers
Currently, there are several operational initiatives of Cloud Computing; however despite all being
called Clouds, they provide different types of services. For that reason, the academic community
([64], [8], [45], [72], and [71]) classified these solutions accurately in order to understand their
relationships. The three complementary proposals for classification are as follows.
2.3.1 Classification according to the intended audience
This first simple taxonomy is suggested by NIST [45] that organizes providers according to the
audience to which the Cloud is aimed. There are four classes in this classification: Private Clouds,
Community Clouds, Public Clouds, and Hybrid Clouds.
The first three classes accommodate providers in a gradual opening of the intended audience
coverage. The Private Cloud class encompasses such types of Clouds destined to be used solely by
an organization operating over their own datacenter or one leased from a third party for exclusive
use. When the Cloud infrastructure is shared by a group of organizations with similar interests it is
classified as a Community Cloud. Furthermore, the Public Cloud class encompasses all initiatives
intended to be used by the general public. Finally, Hybrid Clouds are simply the composition of two
or more Clouds pertaining to different classes (Private, Community, or Public).
2.3.2 Classification according to the service type
In [71], authors offer a classification as represented in Figure 2. Such taxonomy divides Clouds in
five categories: Cloud Application, Cloud Software Environment, Cloud Software Infrastructure,
Software Kernel, and Firmware/Hardware. The authors arranged the different types of Clouds in a
stack, showing that Clouds from higher levels are created using services in the lower levels. This idea
pertains to the definitions of Cloud Computing discussed previously in Sections 2.1 and 2.2.
Essentially, the Cloud provider does not need to be the owner of the infrastructure.

9
Figure 2 Classification of Cloud types (from [71])
The class in the top of the stack, also called Software-as-a-Service (SaaS), involves applications
accessed through the Internet, including social networks, Webmail, and Office tools. Such services
provide software to be used by the general public, whose main interest is to avoid tasks related to
software management like installation and updating. From the point of view of the Cloud provider,
SaaS can decrease costs with software implementation when compared with traditional processes.
Similarly, the Cloud Software Environment, also called Platform-as-a-Service (PaaS), encloses
Clouds that offer programming environments for developers. Through well-defined APIs,
developers can use software modules for access control, authentication, distributed processing, and
so on, in order to produce their own applications in the Cloud. Moreover, developers can contract
services for automatic scalability of their software, databases, and storage services.
In the middle of the stack there is the Cloud Software Infrastructure class of initiatives. This
class encompasses solutions that provide virtual versions of infrastructure devices found in
datacenters like servers, databases, and links. Clouds in this class can be divided into three subclasses
according to the type of resource that is offered by them. Computational resources are grouped in
the Infrastructure-as-a-service (IaaS) subclass that provides generic virtual machines that can be used
in many different ways by the contracting developer. Services for massive data storage are grouped
in the Data-as-a-Service (DaaS) class, whose main mission is to store remotely users’ data on remote,
which allows those users to access their data from anywhere and at anytime. Finally, the third
subclass, called Communications-as-a-Service (CaaS), is composed of solutions that offer virtual
private links and routers through telecommunication infrastructures.
The last two classes do not offer Cloud services specifically, but they are included in the
classification to show that providers offering Clouds in higher layers can have their own software
and hardware infrastructure. The Software Kernel class includes all of the software necessary to
provide services to the other categories like operating systems, hypervisors, cloud management

10
middleware, programming APIs, and libraries. Finally, the class of Firmware/Hardware covers all
sale and rental services of physical servers and communication hardware.
2.3.3 Classification according to programmability
The five-class scheme presented above can classify and organize the current spectrum of Cloud
Computing solutions, but such a model is limited because the number of classes and their
relationships will need to be rearranged as new Cloud services emerge. Therefore, in this Thesis, a
different classification model will be used based on the programmability concept, which was
previously introduced by Endo et al. [19].
Borrowed from the realm of network virtualization [11], programmability is a concept related
to the programming features a network element offers to developers, measuring how much freedom
the developer has to manipulate resources and/or devices. This concept can be easily applied to the
comparison of Cloud Computing solutions. More programmable Clouds offer environments where
developers are free to choose programming paradigms, languages, and platforms. Less
programmable Clouds restrict developers in some way: perhaps by forcing a set of programming
languages or by providing support for only one application paradigm. On the other hand,
programmability directly affects the way developers manage their leased resources. From this point-
of-view, providers of less programmable Clouds are responsible to manage their infrastructure while
being transparent to developers. In turn, a more programmable Cloud leaves more of these tasks to
developers, thus introducing management difficulties due to the more heterogeneous programming
environment.
Thus, Cloud Programmability can be defined as the level of sovereignty under which
developers have to manipulate services leased from a provider. Programmability is a relative
concept, i.e., it was adopted to compare one Cloud with others. Also, programmability is directly
proportional to heterogeneity in the infrastructure of the provider and inversely proportional to the
amount of effort that developers must spend to manage leased resources.
To illustrate how this concept can be used, one can classify two current Clouds: Amazon EC2
and Google App Engine. Clearly the Amazon EC2 is more programmable, since in this Cloud
developers can choose between different virtual machine classes, operating systems, and so on. After
they lease one of these virtual machines, developers can configure it to work as they see fit: as a web
server, as a content server, as a unit for batch processing, and so on. On the other hand, Google
App Engine can be classified as a less programmable solution, because it allows developers to create
Web applications that will be hosted by Google. This restricts developers to the Web paradigm and
to some programming languages.

11
2.4 Mediation System
Figure 3 introduces an Archetypal Cloud Mediation System. This is a conceptual model that will be
used as a reference to the discussion on Resource Management in this Thesis. The Archetypal Cloud
Mediation System focuses on one principle: resource management as the main service of any Cloud
Computing provider. Thus, other important services like authentication, accounting, and security are
out of the scope of this conceptual system and, therefore these services are separated from the
Mediation System in this archetypal Cloud mediation system. Clients also do not factor into this
view of the system, since resource management is mainly related to the allocation of developers’
applications and meeting their requirements.
Figure 3 Components of an Archetypal Cloud Mediation System (adapted from [24])
The mediation system is responsible for the entire process of resource management in the
Cloud. Such a process covers tasks that range from the automatic negotiation of developers
requirements to the execution of their applications. It has three main layers: negotiation, resource
management, and resource control.
The negotiation layer deals with the interface between the Cloud and developers. In the case
of Clouds selling infrastructure services, the interface can be a set of operations based on Web
Services for control of the leased virtual machines. Alternately, in the case of PaaS services, this
interface can be an API for software development in the Cloud. Moreover, the negotiation layer
handles the process of contract establishment between the enterprises and the Cloud. Currently, this
process is simple and the contracts tend to be restrictive. One can expect that in the future, Clouds
will offer more sophisticated avenues for user interaction through high level abstractions and service
level policies.
Mediation
System
Resources
Resource Management
Negotiation
Resource Control
Developers
Auxiliary
Services
Account
Authentication
Security

12
The resource management layer is responsible for the optimal allocation of applications for
obtaining the maximum usage of resources. This function requires advanced strategies and heuristics
to allocate resources that meet the contractual requirements as established with the application
developer. These may include service quality restrictions, jurisdiction restrictions, elastic adaptation,
among others.
Metaphorically, one can say that while the resource management layer acts as the “brain” of
the Cloud, the resource control layer plays the role of its “limbs”. The resource control encompasses
all functions needed to enforce decisions generated by the upper layer. Beyond the tools used to
configure the Cloud resources effectively, all communication protocols used by the Cloud are
included in this layer.
2.5 Groundwork Technologies
Some of the main technologies that used by the current Cloud mediation systems (namely Service-
oriented Computing, Virtualization, MapReduce, and Datacenters) will be discussed.
2.5.1 Service-Oriented Computing
Service-Oriented Computing defines a set of principles, architectural models, and technologies for
the design and development of distributed applications. The recent development of software while
focusing on services gave rise to SOA (Service-Oriented Architecture), which can be defined as an
architectural model “that supports the discovery, message exchange, and integration between loosely
coupled services using industry standards” [37]. The common technology for the implementation of
SOA principles is the Web Service that defines a set of standards to implement services over the
World Wide Web.
In Cloud Computing, SOA is the main paradigm for the development of functions on the
several layers of the Cloud. Cloud providers publish APIs for their services on the web, allowing
developers to use the Cloud and to automate several tasks related to the management of their
applications. Such APIs can assume the form of WSDL documents or REST-based interfaces.
Furthermore, providers can make available Software Development Kits (SDKs) and other toolkits
for the manipulation of applications running on the Cloud.
2.5.2 Server Virtualization
Server virtualization is a technique that allows a computer system to be partitioned onto multiple
isolated execution environments offering a similar service as a single physical computer, which are
called Virtual Machines (VM). Each VM can be configured in an independent way while having its
own operating system, applications, and network parameters. Commonly, such VMs are hosted on a

13
physical server running a hypervisor, the software that effectively virtualizes the server and manages
the VMs [54].
There are several hypervisor options that can be used for server virtualization. From the open-
source community, one can cite Citrix’s Xen1
and the Kernel-based Virtual Machine (KVM)2
. From
the realm of proprietary solutions, some examples are VMWare ESX3
and Microsoft’s HyperV4
.
The main factor that boosted up the adoption of server virtualization within Cloud
Computing is that such technology offers good flexibility regarding the dynamic reallocation of
workloads across servers. Such flexibility allows, for example, providers to execute maintenance on
servers without stopping developers’ applications (that are running on VMs) or to implement
strategies for better resource usage through the migration of VMs. Furthermore, server virtualization
is adapted for the fast provisioning of new VMs through the use of templates, which enables
providers to offer elastic services for applications developers [43].
2.5.3 MapReduce Framework
MapReduce [15] is a programming framework developed by Google for distributed processing of
large data sets across computing infrastructures. Inspired on the map and reduce primitives present
in functional languages, its authors developed an entire framework for the automatic distribution of
computations. In this framework, developers are responsible for writing map and reduce operations
and for using them according to their needs, which is similar to the functional paradigm. These map
and reduce operations will be executed by the MapReduce system that transparently distributes
computations across the computing infrastructure and treats all issues related to node
communication, load balancing, and fault tolerance. For the distribution and synchronization of the
data required by the application, the MapReduce system also requires the use of a specially tailored
distributed file system called Google File System (GFS) [23].
Despite being introduced by Google, there are some open source implementations of the
MapReduce system, like Hadoop [6] and TPlatform [55]. The former is a popular open-source
software used for running applications on large clusters built of commodity hardware. This software
is used by large companies like Amazon, AOL, and IBM, as well as in different Web applications
such as Facebook, Twitter, Last.fm, among others. Basically, Hadoop is composed of two modules:
a MapReduce environment for distributed computing, and a distributed file system called the
Hadoop Distributed File System (HDFS). The latter is an academic initiative that provides a
1 http://www.xen.org/products/cloudxen.html
2 http://www.linux-kvm.org/page/Main_Page
3 http://www.vmware.com/
4 http://www.microsoft.com/hyper-v-server/en/us/default.aspx

14
development platform for Web mining applications. Similarly to Hadoop and Google’s MapReduce,
the TPlatform has a MapReduce module and a distributed file system known as the Tianwang File
System (TFS) [55].
The use of MapReduce solutions is common groundwork technology in PaaS Clouds because
it offers a versatile sandbox for developers. Differently from IaaS Clouds, PaaS developers using a
general-purpose language with MapReduce support do not need to be concerned with software
configuration, software updating and, network configurations. All these tasks are the responsibility
of the Cloud provider, which, in turn, benefits from the fact that such configurations will be
standardized across the overall infrastructure.
2.5.4 Datacenters
Developers who are hosting their applications on a Cloud wish to scale their leased resources,
effectively increasing and decreasing their virtual infrastructure according to the demand of their
clients. This is also the case for developers making use of their own private Clouds. Thus,
independently of the class of Cloud under consideration, a robust and safe infrastructure is needed.
Whereas virtualization and MapReduce respond for the software solution required to attend
this demand, the physical infrastructure of Clouds is based on datacenters, which are infrastructures
composed of TI components providing processing capacity, storage, and network services for one
or more organizations [66]. Currently, the size of a datacenter (in number of components) can vary
from tens of components to tens of thousands of components depending on the datacenter’s
mission. In addition, there are several different TI components for datacenters including switches
and routers, load balancers, storage devices, dedicated storage networks, and, the main component
of any datacenter, in other words, servers [27].
Cloud Computing datacenters provide the required power to attend developers’ demands in
terms of processing, storage, and networking capacities. A large datacenter, running a virtualization
solution, allows for better granularity division of the hardware’s power through the statistical
multiplexing of developers’ applications.

15
3 Distributed Cloud Computing
“Quae non prosunt singula, multa iuvant.”
Ovid
This chapter discusses the main concepts of Distributed Cloud (D-Cloud) Computing. It begins
with a discussion of their definition (Section 3.1) in an attempt to distinguish the D-Cloud from the
current Clouds and highlight their main characteristics. Next, the main research challenges regarding
resource management on D-Clouds will be described in Section 3.2.
3.1 Definitions
Current Cloud Computing setups involve a huge amount of investments as part of the datacenter,
which is the common underlying infrastructure of Clouds as previously detailed in Section 2.5.4.
This centralized infrastructure brings many well-known challenges such as the need for resource
over-provisioning and the high cost for heat dissipation and temperature control. In addition to
concerns with infrastructure costs, one must observe that those datacenters are not necessarily close
to their clients, i.e., the network between end-users and the Cloud is often a long best-effort IP
connection, which means longer round-trip delays.
Considering such limitations, industry and academy researchers have presented indicatives that
small datacenters can be sometimes more attractive since they offer a cheaper and low-power
consumption alternative while also reducing the infrastructure costs of centralized Clouds [12].
Moreover, Distributed Clouds, or just D-Clouds, as pointed out by Endo et al. in [20], can exploit
the possibility of links creation and the potential of sharing resources across geographic boundaries
to provide latency-based allocation of resources to ultimately fully utilize this distributed computing
power. Thus, D-Clouds can reduce communication costs by simply provisioning data, servers, and
links close to end-users.
Figure 4 illustrates how D-Clouds can reduce the cost of communication through the spread
of computational power and the usage of a latency-based allocation of applications. In Figure 4(a)
the client uses an application (App) running on the Cloud through the Internet, which is subject to
the latency imposed by the best-effort network. In Figure 4(b), the client is accessing the same App,

16
but in this case, the latency imposed by the network will be reduced due to the allocation of the App
in a server that is in a small datacenter closest to the client than the previous scenario.
(a) (b)
Figure 4 Comparison between (a) a current Cloud and (b) a D-Cloud
Please note that the Figure 4(b) intentionally does not specify the network connecting the
infrastructure of the D-Cloud Provider. This network can be rented from different local ISPs (using
the Internet for interconnection) or from an ISP with wide area coverage. In addition, such ISP
could be the own D-Cloud Provider itself. This may be the case as the D-Cloud paradigm
introduces an organic change in the current Internet where ISPs can start to play as D-Cloud
providers. Thus, ISPs could offer their communication and computational resources for developers
interested in deploying their applications at the specific markets covered by those ISPs.
This idea is illustrated by Figure 5 that shows a D-Cloud offered by a hypothetical Brazilian
ISP. In this example, a developer deployed its application (App) on two servers in order to attend
requests from northern and southern clients. If the number of northeastern clients increases, the
developer can deploy its App (represented by the dotted box) on one server close to the northeast
region in order to improve its service quality. It is important to pay attention to the fact that the
contribution of this Thesis falls in this last scenario, i.e., a scenario where the network and
computational resources are all controlled by the same provider.
CloudProvider
Client
Internet
App
Client
App
DistributedCloudProvider

17
Figure 5 ISP-based D-Cloud example
D-Clouds share similar characteristics with current Cloud Computing, including essential
offerings such as scalability, on demand usage, and pay-as-you-go business plans. Furthermore, the
agents already stated for current Clouds (please see Figure 1) are exactly the same in the context of
D-Clouds. Finally, the many different classifications discussed in Section 2.3 can be applied also.
Despite the similarity, one may highlight two peculiarities of D-Clouds: support to geo-locality and
Network as a Service (NaaS) provisioning ([2], [63], [17]).
The geographical diversity of resources potentially improves cost and performance and gives
an advantage to several different applications, particularly, those that do not require massive internal
communication among large server pools. In this category, as pointed out by [12], one can
emphasize, firstly, applications being currently deployed in a distributed manner, like VOIP (Voice
over IP) and online games; secondly, one can indicate the applications that are good candidates for
distributed implementation, like traffic filtering and e-mail distribution. In addition, there are other
different types of applications that use software or data with specific legal restrictions on
jurisdiction, and specific applications whose public is restricted to one or more geographical areas,
like the tracking of buses or subway routes, information about entertainment events, local news, etc.
Support for geo-locality can be considered to be a step further in the deployment of Cloud
Computing that leverages new opportunities for service providers. Thus, they will be free to choose
where to allocate their resources in order to attend to specific niches, constraints on jurisdiction of
software and data, or quality of service aspects of end-users.
The NaaS (or Communication as a Service – CaaS as cited in section 2.3.2) allows service
providers to manage network resources, instead of just computational ones. Authors in [2] call NaaS
as a service offering transport network connectivity with a level of virtualization suitable to be
App
App
App

18
invoked by service providers. In this way, D-Clouds are able to manage their network resources
according to their convenience, offering better response time for hosted applications. The NaaS is
close to the Network Virtualization (NV) research area [31], where the main problem consists in
choosing how to allocate a virtual network over a physical one, meeting requirements and
minimizing usage of the physical resources. Although NV and D-Clouds are subject to similar
problems and scenarios, there is an essential difference between these two. While NV commonly
models its resources at the infrastructure level (requests are always virtual networks mapped on
graphs), a D-Cloud can be engineered to work with applications in a different abstraction level,
exactly as it occurs with actual Cloud service types like the ones described at Section 2.3.2. This way,
one may see Network Virtualization simply as a particular instance of the D-Cloud. Other insights
about NV are given in Section 3.3.2.
Finally, it must be highlighted that the D-Cloud does not compete with the current Cloud
Computing paradigm, since the D-Cloud merely fits a certain type of applications that have hard
restrictions on geographical location, while the existent Clouds continue to be attracting for
applications demanding massive computational resources or simple applications with minor or no
restrictions on geographical location. Thus, the current Cloud Computing providers are the first
potential candidates to take advantage of the D-Cloud paradigm, since the current Clouds could hire
D-Cloud resources on-demand and move the applications to certain geographical locations in order
to meet specific developers’ requirements. In addition to the current Clouds, the D-Clouds can also
serve the developers directly.
3.2 Research Challenges inherent to Resource Management
D-Clouds face challenges similar to the ones presented in the context of current Cloud Computing.
However, as stated in Chapter 1, the object of the present study is the resource management in D-
Clouds. Thus, this Section gives special emphasis to the challenges for resource management in D-
Clouds, while focusing on four categories as presented in [20]: a) resource modeling; b) resource
offering and treatment; c) resource discovery and monitoring; and d) resource selection.
3.2.1 Resource Modeling
The first challenge is the development of a suitable resource model that is essential to all operations
in the D-Cloud, including management and control. Optimization algorithms are also strongly
dependent of the resource modeling scheme used.
In a D-Cloud environment, it is very important that resource modeling takes into account
physical resources as well as virtual ones. On one hand, the amount of details in each resource
should be treated carefully, since if resources are described with great details, there is a risk that the

19
resource optimization becomes hard and complex, since the computational optimization problem
considering the several modeled aspects can create NP-hard problems. On the other hand, more
details give more flexibility and leverage the usage of resources.
There are some alternatives for resource modeling in Clouds that could be applied to D-
Clouds. One can cite, for example, the OpenStack software project [53], which is focused on
producing an open standard Cloud operating system. It defines a Restful HTTP service that
supports JSON and XML data formats and it is used to request or to exchange information about
Cloud resources and action commands. OpenStack also offers ways to describe how to scale server
down or up (using pre-configured thresholds); it is extensible, allowing the seamless addition of new
features; and it returns additional error messages in faults case.
Other resource modeling alternative is the Virtual Resources and Interconnection Networks
Description Language (VXDL) [39], whose main goal is to describe resources that compose a virtual
infrastructure while focusing on virtual grid applications. The VXDL is able to describe the
components of an infrastructure, their topology, and an execution chronogram. These three aspects
compose the main parts of a VXDL document. The computational resource specification part
describes resource parameters. Furthermore, some peculiarities of virtual Grids are also present,
such as the allocation of virtual machines in the same hardware and location dependence. The
specification of the virtual infrastructure can consider specific developers’ requirements such as
network topology and delay, bandwidth, and the direction of links. The execution chronogram
specifies the period of resource utilization, allowing efficient scheduling, which is a clear concern for
Grids rather than Cloud computing. Another interesting point of VXDL is the possibility of
describing resources individually or in groups, according to application needs. VXDL lacks support
for distinct services descriptions, since it is focused on grid applications only.
The proposal presented in [32], called VRD hereafter, describes resources in a network
virtualization scenario where infrastructure providers describe their virtual resources and services
prior to offering them. It takes into consideration the integration between the properties of virtual
resources and their relationships. An interesting point in the proposal is its use of functional and
non-functional attributes. Functional attributes are related to characteristics, properties, and
functions of components. Non-functional attributes specify criteria and constraints, such as
performance, capacity, and QoS. Among the functional properties that must be highlighted is the set
of component types: PhysicalNode, VirtualNode, Link, and Interface. Such properties suggest a
flexibility that can be used to represent routers or servers, in the case of nodes, and wired or wireless
links, in the case of communication links and interfaces.

20
Another proposal known as the Manifest language was developed by Chapman et al. [9]. They
proposed new meta-models to represent service requirements, constraints, and elasticity rules for
software deployment in a Cloud. The building block of such framework is the OVF (Open
Virtualization Format) standard, which was extended by Chapman et al. to perform the vision of D-
Clouds considering locality constraints. These two points are very interesting to our scenario. With
regard to elasticity, it assumes a rule-based specification formed by three fields: a monitored
condition related to the state of the service (such as workload), an operator (relational and logical
ones are accepted), and an associated action to follow when the condition is met. The location
constraints identify sites that should be favored or avoided when selecting a location for a service.
Nevertheless, the Manifest language is focused on the software architecture. Hence, the language is
not concerned with other aspects such as resources’ status or network resources.
Cloud# is a language for modeling Clouds proposed by [16] to be used as a basis for Cloud
providers and clients to establish trust. The model is used by developers to understand the behavior
of Cloud services. The main goal of Cloud# is to describe how services are delivered, while taking
into consideration the interaction among physical and virtual resources. The main syntactic
construct within Cloud# is the computation unit CUnit, which can model Cloud systems, virtual
machines, or operating systems. A CUnit is represented as a tuple of six components modeling
characteristics and behaviors. This language gives developers a better understanding of the Cloud
organization and how their applications are dealt with.
3.2.2 Resource Offering and Treatment
Once the D-Cloud resources are modeled, the next challenge is to describe how resources are
offered to developers, which is important since the requirements supported by the provider are
defined in this step. Such challenge will also define the interfaces of the D-Cloud. This challenge
differs from resource modeling since the modeling is independent of the way that resources are
offered to developers. For example, the provider could model each resource individually, like
independent items in a fine-grained scale such as GHz of CPU or GB of memory, but could offer
them like a coupled collection of those items or a bundle, such as VM templates as cited at Section
2.5.2.
Recall that, in addition to computational requirements (CPU and memory) and traditional
network requirements, such as bandwidth and delay, new requirements are present under D-Cloud
scenarios. The topology of the nodes is a first interesting requirement to be described. Developers
should be able to set inter-nodes relationships and communication restrictions (e.g., downlink and
uplink rates). This is illustrated in the scenario where servers – configured and managed by

21
developers – are distributed at different geographical localities while it is necessary for them to
communicate with each other in a specific way.
Jurisdiction is related to where (geographically) applications and their data must be stored and
handled. Due to restrictions such as copyright laws, D-Cloud users may want to limit the location
where their information will be stored (such as countries or continents). Other geographical
constraint can be imposed by a maximum (or minimum) physical distance (or delay value) between
nodes. Here, though developers do not know about the actual topology of the nodes, they may
merely establish some delay threshold value for example.
Developers should also be able to describe scalability rules, which would specify how and
when the application would grow and consume more resources from the D-Cloud. Authors in [21]
and [9] define a way of doing this, allowing the Cloud user to specify actions that should be taken,
like deploying new VMs, based on thresholds of metrics monitored by the D-Cloud itself.
Additionally, resource offering is associated to interoperability. Current Cloud providers offer
proprietary interfaces to access their services, which can hinder users within their infrastructure as
the migration of applications cannot be easily made between providers [8]. It is hoped that Cloud
providers identify this problem and work together to offer a standardized API.
According to [61], Cloud interoperability faces two types of heterogeneities: vertical
heterogeneity and horizontal heterogeneity. The first type is concerned with interoperability within a
single Cloud and may be addressed by a common middleware throughout the entire infrastructure.
The second challenge, the horizontal heterogeneity, is related to Clouds from different providers.
Therefore, the key challenge is dealing with these differences. In this case, a high level of granularity
in the modeling may help to address the problem.
An important effort in the search for horizontal standardization comes from the Open Cloud
Manifesto5
, which is an initiative supported by hundreds of companies that aims to discuss a way to
produce open standards for Cloud Computing. Their major doctrines are collaboration and
coordination of efforts on the standardization, adoption of open standards wherever appropriate,
and the development of standards based on customer requirements. Participants of the Open Cloud
Manifesto, through the Cloud Computing Use Case group, produced an interesting white paper [51]
highlighting the requirements that need to be standardized in a cloud environment to ensure
interoperability in the most typical scenarios of interaction in Cloud Computing.
5 http://www.opencloudmanifesto.org/

22
Another group involved with Cloud standards is the Open Grid Forum6
, which is intended to
develop the specification of the Open Cloud Computing Interface (OCCI)7
. The goal of OCCI is to
provide an easily extendable RESTful interface Cloud management. Originally, the OCCI was
designed for IaaS setups, but their current specification [46] was extended to offer a generic scheme
for the management of different Cloud services.
3.2.3 Resource Discovery and Monitoring
When requests reach a D-Cloud, the system should be aware of the current status of resources, in
order to determine if there are available resources in the D-Cloud that could satisfy the requests. In
this way, the right mechanisms for resource discovery and monitoring should also be designed,
allowing the system to be aware of the updated status of all its resources. Then, based on the current
status and request’ requirements, the system may select and allocate resources to serve these new
request.
Resource monitoring should be continuous and help taking allocation and reallocation
decisions as part of the overall resource usage optimization. A careful analysis should be done to
find a good and acceptable trade-off between the amount of control overhead and the frequency of
resource information updating.
The monitoring may be passive or active. It is considered passive when there are one or more
entities collecting information. The entity may continuously send polling messages to nodes asking
for information or may do this on-demand when necessary. On the other hand, the monitoring is
active when nodes are autonomous and may decide when to send asynchronously state information
to some central entity. Naturally, D-Clouds may use both alternatives simultaneously to improve the
monitoring solution. In this case, it is necessary to synchronize updates in repositories to maintain
consistency and validity of state information.
The discovery and monitoring in a D-Cloud can be accompanied by the development of
specific communication protocols. Such protocols act as a standard plane for control in the Cloud,
allowing interoperability between devices. It is expected that such type of protocols can control the
different elements including servers, switches, routers, load balancers, and storage components
present in the D-Cloud. One possible method of coping with this challenge is to use smart
communication nodes with an open programming interface to create new services within the node.
One example of this type of open nodes can be seen in the emerging Openflow-enabled switches
[44].
6 http://www.gridforum.org/
7 http://occi-wg.org/about/specification/

23
3.2.4 Resource Selection and Optimization
With information regarding Cloud resource availability at hand, a set of appropriate candidates may
then be highlighted. Next, the resource selection process finds the configuration that fulfills all
requirements and optimizes the usage of the infrastructure. Selecting solutions from a set of
available ones is not a trivial task due to the dynamicity, high algorithm complexity, and all different
requirements that must be contemplated by the provider.
The problem of resource allocation is recurrent on computer science, and several computing
areas have faced such type of problem since early operating systems. Particularly in the Cloud
Computing field, due to the heterogeneous and time-variant environment in Clouds, the resource
allocation becomes a complex task, forcing the mediation system to respond with minimal
turnaround time in order to maintain the developer’s quality requirements. Also, balancing
resources’ load and projecting energy-efficient Clouds are major challenges in Cloud Computing.
This last aspect is especially relevant as a result of the high demand for electricity to power and to
cool the servers hosted on datacenters [7].
In a Cloud, energy savings may be achieved through many different strategies. Server
consolidation, for example, is a useful strategy for minimizing energy consumption while
maintaining high usage of servers’ resources. This strategy saves the energy migrating VMs onto
some servers and putting idle servers into a standby state. Developing automated solutions for
server consolidation can be a very complex task since these solutions can be mapped to bin-packing
problems known to be NP-hard [72].
VM migration and cloning provides a technology to balance load over servers within a Cloud,
provide fault tolerance to unpredictable errors, or reallocate applications before a programmed
service interruption. But, although this technology is present in major industry hypervisors (like
VMWare or Xen), there remains some open problems to be investigated. These include cloning a
VM into multiple replicas on different hosts [40] and developing VM migration across wide-area
networks [14]. Also, the VM migration introduces a network problem, since, after migration, VMs
require adaptation of the link layer forwarding. Some of the strategies for new datacenter
architectures explained in [67] offer solutions to this problem.
Remodeling of datacenter architectures is other research field that tries to overcome
limitations on scalability, stiffness of address spaces, and node congestion in Clouds. Authors in [67]
surveyed this theme, highlighted the problems on network topologies of state-of-the-art datacenters,
and discussed literature solutions for these problems. One of these solutions is the D-Cloud, as

24
pointed also by [72], which offers an energy efficient alternative for constructing a cloud and an
adapted solution for time-critical services and interactive applications.
Considering specifically the challenges on resource allocation in D-Clouds, one can highlight
correlated studies based on the Placement of Replicas and Network Virtualization. The former is
applied into Content Distribution Networks (CDNs) and it tries to decide where and when content
servers should be positioned in order to improve system’s performance. Such problem is associated
with the placement of applications in D-Clouds. The latter research field can be applied to D-Clouds
considering that a virtual network is an application composed by servers, databases, and the network
between them. Both research fields will be described in following sections.
Replica Placement
Replica Placement (RP) consists of a very broad class of problems. The main objective of this type
of problems is to decide where, when, and by whom servers or their content should be positioned in
order to improve CDN performance. The correspondent existing solutions to these problems are
generally known as Replica Placement Algorithms (RPA) [35].
The general RP problem is modeled as a physical topology (represented by a graph), a set of
clients requesting services, and some servers to place on the graph (costs per server can be
considered instead). Generally, there is a pre-established cost function to be optimized that reflects
service-related aspects, such as the load of user’s requests, the distance from the server, etc. As
pointed out by [35], an RPA groups these aspects into two different components: the problem
definition, which consists of a cost function to be minimized under some constraints, and a
heuristic, which is used to search for near-optimal solutions in a feasible time frame, since the
defined problems are usually NP-complete.
Several different variants of this general problem were already studied. But, according to [57],
they fall into two classes: facility location and minimum K-median. In the facility location problem,
the main goal is to minimize the total cost of the graph through the placement of a number of
servers, which have an associated cost. The minimum K-median problem, in turn, is similar but
assumes the existence of a pre-defined number K of servers. More details on the modeling and
comparison between different variants of the RP problem are provided by [35].
Different versions of this problem can be mapped onto resource allocation problems in D-
Clouds. A very simple mapping can be defined considering an IaaS service where virtual machines
can be allocated in a geo-distributed infrastructure. In such mapping, the topology corresponds to
the physical infrastructure elements of the D-Cloud, the VMs requested by developers can be
treated as servers, and the number of clients accessing each server would be their load.

25
Qiu et al. [57] proposed three different algorithms to solve the K-median problem in a CDN
scenario: Tree-based algorithm, Greedy algorithm, and Hot Spot algorithm. The Tree-based solution
assumes that the underlying graph is a tree that is divided into several small trees, placing each server
in each small tree. The Greedy algorithm places servers one at a time in order to obtain a better
solution in each step until all servers are allocated. Finally, the Hot Spot solution attempts to place
servers in the vicinity of clients with the greatest demand. The results showed that the Greedy
Algorithm for replica placement could provide CDNs with performance that is close to optimal.
These solutions can be mapped onto D-Clouds considering the simple scenario of VM
allocation on a geo-distributed infrastructure with the restriction that each developer has a fixed
number of servers to attend their clients. In such case, this problem can be straightforwardly
reduced to the K-median problem and the three solutions proposed could be applied. Basically, one
could treat each developer as a different CDN and optimize each one independently still considering
a limited capacity of the physical resources caused by the allocation of other developers.
Presti et al. [56], treat a RP variant considering a trade-off between the load of requests per
content and the number of replica additions and removals. Their solution considers that each server
in the physical topology decides autonomously, based on thresholds, when to clone overloaded
contents or to remove the underutilized ones. Such decisions also encompass the minimization of
the distance between clients and the respective accessed replica. A similar problem is investigated in
[50], but considering constraints on the QoS perceived by the client. The authors propose a
mathematical offline formulation and an online version that uses a greedy heuristic. The results
show that the heuristic presents good results with minor computational time.
The main focus of these solutions is to provide scalability to the CDN according to the load
caused by client requests. Thus, despite working only with the placement of content replicas, such
solutions can be also applied to D-Clouds with some simple modifications. Considering replicas as
allocated VMs, one can apply the threshold-based solution proposed in [56] to the simple scenario
of VM scalability on a geo-distributed infrastructure.
Network Virtualization
The main problem of NV is the allocation of virtual networks over a physical network [10] and [3].
Analogously, D-Clouds’ main goal is to allocate application requests on physical resources according
to some constraints while attempting to obtain a clever mapping between the virtual and physical
resources. Therefore, problems on D-Clouds can be formulated as NV problems, especially in
scenarios considering IaaS-level services.

26
Several instances of the NV based resource allocation problem can be reduced to a NP-hard
problem [48]. Even the versions where one knows beforehand all the virtual network requests that
will arrive in the system is NP-hard. The basic solution strategy thus is to restrict the problem space
making it easier to deal with and also consider the use of simple heuristic-based algorithms to
achieve fast results.
Given a model based on graphs to represent both physical and virtual servers, switches, and
links [10], an algorithm that allocates virtual networks should consider the constraints of the
problem (CPU, memory, location or bandwidth limits) and an objective function based on the
algorithm objectives. In [31], the authors describe some possible objective functions to be
optimized, like the ones related to maximize the revenue of the service provider, minimizing link
and nodes stress, etc. They also survey heuristic techniques used when allocating the virtual
networks dividing them in two types: static and dynamic. The dynamic type permits reallocating
along the time by adding more resources to already allocated virtual networks in order to obtain a
better performance. The static one means once a virtual network is allocated it will hardly ever
change its setup.
To exemplify the type of problems studied on NV, one can be driven to discuss the one
studied by Chowdhury et al. [10]. Its authors propose an objective function related to the cost and
revenue of the provider and constrained by capacity and geo-location restrictions. They reduce the
problem to a mixed integer programming problem and then relax the integer constraints through the
deriving of two different algorithms for the solution’s approximation. Furthermore, the paper also
describes a Load Balancing algorithm, in which the original objective function is customized
in order to avoid using nodes and links with low residual capacity. This approach implies in
allocation on less loaded components and an increase of the revenue and acceptance ratio of
the substrate network.
Such type of problem and solutions can be applied to D-Clouds. One example could be the
allocation of interactive servers with jurisdiction restrictions. In this scenario, the provider must
allocate applications (which can be mapped on virtual networks) whose nodes are linked and that
must be close to a certain geographical place according to a maximum tolerated delay. Thus, a
provider could apply the proposed algorithms with minor simple adjustments.
In the paper of Razzaq and Rathore [58], the virtual network embedding algorithm is divided
in two steps: node mapping and link mapping. In the node mapping step, nodes with highest
resource demand are allocated first. The link mapping step is based on an edge disjoint k-shortest
path algorithm, by selecting the shortest path which can fulfill the virtual link bandwidth

27
requirement. In [42], a backtracking algorithm for the allocation of virtual networks onto substrate
networks based on the graph isomorphism problem is proposed. The modeling considers multiple
capacity constraints.
Zhu and Ammar [74] proposed a set of four algorithms with the goal of balancing the load on
the physical links and nodes, but their algorithms do not consider capacity aspects. Their algorithms
perform the initial allocation and make adaptive optimizations to obtain better allocations. The key
idea of the algorithms is to allocate virtual nodes considering the load of the node and the load of
the neighbor links of that node. Thus one can say that they perform the allocation in a coordinated
way. For virtual link allocation, the algorithm tries to select paths with few stressed links in the
network. For more details about the algorithm see [74].
Considering the objectives of NV and RP problems, one may note that NV problems are a
general form of the RP problem: RP problems try to allocate virtual servers whereas NV considers
allocation of virtual servers and virtual links. Both categories of problems can be applied to D-
Clouds. Particularly, RP and NV problems may be respectively mapped on two different classes of
D-Clouds: less controllable D-Clouds and more controllable ones, respectively. The RP problems
are suitable for scenarios where allocation of servers is more critical than links. In turn, the NV
problems are especially adapted to situations where the provider is an ISP that has full control over
the whole infrastructure, including the communication infrastructure.
3.2.5 Summary
The D-Clouds’ domain brings several engineering and research challenges that were discussed in this
section and whose main aspects are summarized in Table I. Such challenges are only starting to
receive attention from the research community. Particularly, the system, models, languages, and
algorithms presented in the next chapters will cope with some of these challenges.
Table I Summary of the main aspects discussed
Categories Aspects
Resource Modeling
Heterogeneity of resources
Physical and virtual resources must be considered
Complexity vs. Flexibility
Resource Offering
and Treatment
Describe the resources offered to developers
Describe the supported requirements
New requirements: topology, jurisdiction, scalability
Resource Discovery
and Monitoring
Monitoring must be continuous
Control overhead vs. Updated information
Resource Selection
and Optimization
Find resources to fulfill developer’s requirements
Optimize usage of the D-Cloud infrastructure
Complex problems solved by approximation algorithms

28
4 The Nubilum System
“Expulsa nube, serenus fit saepe dies.”
Popular Proverb
Section 2.4 introduced an Archetypal Cloud Mediation system focusing specifically on the resource
management process that ranges from the automatic negotiation of developers requirements to the
execution of their applications. Further, this system was divided into three layers: negotiation,
resource management, and resource control. Keeping in mind this simple archetypal mediation
system, this chapter presents Nubilum a resource management system that offers a self-managed
solution for challenges resulting from the discovery, monitoring, control, and allocation of resources
in D-Clouds. This system appears previously in [25] under the name of D-CRAS (Distributed Cloud
Resource Allocation System).
Section 4.1 presents some decisions taken to guide the overall design and implementation of
Nubilum. Section 4.2 presents a conceptual view of the Nubilum’s architecture highlighting their
main modules. The functional components of Nubilum are detailed in Section 4.3. Section 4.4
presents the main processes performed by Nubilum. Section 4.5 closes this chapter by summarizing
the contributions of the system and comparing them with correlated resource management systems.
4.1 Design Rationale
As stated previously in Section 1.2, the objective of this Thesis is to develop a self-manageable
system for resource management on D-Clouds. Before the development of the system and their
correspondent architecture, some design decisions that will guide the development of the system
must be delineated and justified.
4.1.1 Programmability
The first aspect to be defined is the abstraction level in which Nubilum will act. Given that D-
Clouds concerns can be mapped on previous approaches on Replica Placement (see Section 0) and
Network Virtualization (see Section 0) research areas, a straightforward approach would be to
consider a D-Cloud working at the same abstraction level. Therefore, knowing that proposals in
both areas commonly seem to work at the IaaS level, i.e., providing virtualized infrastructures,
Nubilum would naturally also operate at the IaaS level.

29
Nubilum offers a Network Virtualization service. Applications can be treated as virtual
networks and the provider’s infrastructure is the physical network. In this way, the allocation
problem is a virtual network assignment problem and previous solutions for the NV area can be
applied. Note that such approach does not exclude previous Replica Placement solutions because
such area can be viewed as a particular case of Network Virtualization.
4.1.2 Self-optimization
As defined in Section 2.1, the Cloud must provide services in a timely manner, i.e., resources
required by users must be configured as quickly as possible. In other words, to meet such restriction,
Nubilum must operate as much as possible without human intervention, which is the very definition
of self-management from Autonomic Computing [69].
The operation involves maintenance and adjustment of the D-Cloud resources in the face of
changing application demands and innocent or malicious failures. Thus, Nubilum must provide
solutions to cope with the four aspects leveraged by Autonomic Computing: self-configuration, self-
healing, self-optimization, and self-protection. Particularly, this Thesis focuses on investigating self-
optimization – and, at some levels possibly, self-configuration – on D-Clouds. The other two
aspects are considered out of scope of this proposal.
According to [69], self-optimization of a system involves letting its elements “continually seek
ways to improve their operation, identifying and seizing opportunities to make themselves more
efficient in performance or cost”. Such definition fits very well the aim of Nubilum, which must
ensure an automatic monitoring and control of resources to guarantee the optimal functioning of
the Cloud while meeting developers’ requirements.
4.1.3 Existing standards adoption
The Open Cloud Manifesto, an industry initiative that aims to discuss a way to produce open
standards for Cloud Computing, states that Cloud providers “must use and adopt existing standards
wherever appropriate” [51]. The Manifesto argues that several efforts and investments have been
made by the IT industry in standardization, so it seems more productive and economic to use such
standards when appropriate. Following this same line, Nubilum will adopt some industry standards
when possible. Such adoption is also extended to open processes and software tools.
4.2 Nubilum’s conceptual view
As shown in Figure 6, the conceptual view of Nubilum’s architecture is composed of three planes: a
Decision plane, a Management plane, and an Infrastructure plane. Starting from the bottom, the
lower plane nestles all modules responsible for the appropriate virtualization of each resource in the

Nubilum: Resource Management System for Distributed Clouds

Nubilum: Resource Management System for Distributed Clouds

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Andere mochten auch

Andere mochten auch (16)

Ähnlich wie Nubilum: Resource Management System for Distributed Clouds

Ähnlich wie Nubilum: Resource Management System for Distributed Clouds (20)

Mehr von Glauco Gonçalves

Mehr von Glauco Gonçalves (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Nubilum: Resource Management System for Distributed Clouds