This summarizes a document discussing using SDNs to help with flow-based differentiation in datacenters. It presents two approaches:
1. A reactive approach where the first packet of each flow is sent to the SDN controller, allowing more fine-grained control over queue placement and bandwidth allocation. However, this incurs high flow setup costs.
2. A proactive approach where the controller prepopulates flow entries based on metrics. This avoids flow setup time but provides limited visibility and inaccurate estimates, potentially worsening performance for priority and weighted fair queueing schemes.
Both approaches have tradeoffs, and the document analyzes their relative benefits and disadvantages for different queueing strategies.
A MULTI-OBJECTIVE PERSPECTIVE FOR OPERATOR SCHEDULING USING FINEGRAINED DVS A...
SPROJReport (1)
1. FlowBased Differentiation in Datacenters: Can SDNs help?
Adil Ahmad, Hashaam Mahboob, Mohammad Laghari & Ihsan Ayyub Qazi
Department of Computer Science
School of Science & Engineering
LUMS
Abstract:
Datacenter workloads show a mix of short and long flows. Short flows are latency sensitive
whereas long flows are throughput sensitive. This variation in requirement according to flow sizes
has brought about the need for differentiated services achieved through separate queues. In this
work, we try to reason that by using SDN, we can achieve better flow completion times and also
provide reasonably high and predictable throughput for long flows. We examine two popular
queuing principles, priority queueing and weightedfair or weighted roundrobin queueing. Our
goal is not to put one against the other but to merely present the various possibilities. We talk
about the challenges faced when using multiple queues and also present solutions that involve the
use of SDNs. Using OpenFlow controllers, we can adjust which queue a flow is queued in, for
priority queues and dynamically update the bandwidth of queues, in the case of weightedfair
queues. We present experimental results gathered from various experiments with these queues.
Introduction:
Almost all of the biggest tech companies such as Microsoft, Google and Facebook use datacenters
for their userserving applications. These datacenters host a variety of different services such as
web servers, database servers, MapReduce, Memcache etc. These services generate a mix of
mice, medium and long flows. For these services, serving times as well as quality of query results
are of prime importance. Moreover, it is wellknown that short flows (<=1MB) are
latencysensitive whereas long flows (>1MB) are generally throughput sensitive [1]. Short flows
comprise of web documents, small images etc., whereas long flows include VM migrations,
Hadoop results etc.
To cater to such a wide demand of applications/services, various schemes [2, 3, 4, 5, 6, 7] have
been proposed in the past. Some of these schemes [2, 6] are based on pure endhost based
modifications. However, recent work [3, 7] has shown that pure endhost modifications result in
suboptimal performance for both short flows and long flows. The main reason for this is that
queuing delays contribute a majority of the latency experienced by flows. Therefore, most of the
recent work [5, 7] has focused on separating flows into queues based on principles such as Least
Attained Service (LAS) [8] or Shortest Remaining Processing Time (SRPT) [9]. The majority of the
work has focused on using priority queues to attain nearoptimal flow completion times for short
flows.
In the sphere of priority queueing, pFabric [3] and PIAS [5] are the most notable works. pFabric
uses SRPT based scheduling and separates flows into a large number of priority queues.
Theoretically, it provides nearoptimal performance. pFabric switches require very few buffers and
rate control is minimal with high packet loss used to backoff. However, pFabric is not readily
2. deployable in current datacenters since it requires a large number of queues and more intelligent
switches than are currently available.
PIAS [5] also uses a range of priority queues to emulate LAS in datacenter switches. Flows start at
the highest priority and get transferred to a lower priority queue as they exceed certain limits.
PIAS selects a range of thresholds which it uses to demote flows from higher priority queues. The
thresholds have been optimized across workloads and average load across datacenters. The
authors of PIAS have shown that it performs better than DCTCP [2] and provides comparable
performance to pFabric. PIAS however, can be deployed in current switches with a few changes
and does not require as many queues as pFabric does.
Mulbuff[4], is another scheme that employs the use of two weightedfair queues to provide flow
differentiation. It is the only scheme so far that has advocated the use of weightedfair queues
over priority queues. Flows are divided based on their size with short flows occupying one queue
and long flows occupying another. Mulbuff attempts to dynamically allocate bandwidths based on
the number of flows in each queue. Due to the bursty nature of datacenter traffic [10], it is
expected that those burst would inherently favor short flows in case of MulBuff. The authors
argue that using a dynamic bandwidth allocation model, we can negotiate the problems of
bandwidth wastage and starvation.
In this work, we attempt to study priority queues and weightedfair queues from the standpoint of
SDNenabled datacenter switches. In section 2, we attempt to motivate the problems faced by
researchers when using any of the two queues and show how SDNs can help solve them. In
section 3, we discuss various design principles that we feel could be used in the future to design
schemes based on flowdifferentiations. We also discuss the pros and cons of each design. In
section 4, we present some of the experiments that we performed with priority and weightedfair
queues using some of the designs proposed in section 3 and provide some analysis. In section 5,
we present a few ideas for the future. In section 6, we conclude with a brief summary of the
overall work.
Motivation:
In this section, we attempt to motivate the usefulness of SDN based solutions when it comes to
developing datacenter transport strategies by illustrating various challenges and presenting their
solutions that involve SDNs.
In datacenter networks, switches are usually not aware of flow sizes and that is why most
stateofthe art transport strategies either employ SRPT or LAS to estimate flow sizes and provide
improved performance. We believe that we can use SDN statistics collection, such as that of
OpenFlow, to estimate flow sizes and provide differentiated services accordingly. If each flow
entry corresponds to a single flow, we can estimate both the current size of the flow and the total
number of short and long flows. Using this information, we can update our flows or queues
accordingly. However, there are various limitations of commodity switches that prevent us from
doing so.
Switches used in datacenters do not possess very high specifications and they pose certain
challenges for researchers. These challenges include the following:
3.
1. Switch CPUs are relatively lowend
2. Fast hardware ASICs are employed but have a limited number of virtual interfaces or
queues
3. TCAM space for storing flow entries is limited
We dissect the three problems below and present viable solutions.
Figure 1: Depicting the use of SDNs in switches
In traditional datacenters, with TCP being the most widely used [1] transport protocol, there is
little need for computation speed at the switch and most of the commonly used switches have
very little processing power. To perform any sort of computation at regular intervals is a challenge.
With SDNenabled switches, we can have controllers working on separate nodes and providing
computations for the switch to use. An OpenFlow [11] controller can leverage the rich statistical
data collected at the switches and provide commands for the switch to execute.
Another limitation comes about in the form of limited number of queues per port (typically 8) [3].
This major limitation has halted the thought of deploying nearoptimal schemes such as pFabric
and is one of the design rationale behind PIAS and PASE. However, using an OpenFlow controller,
we can mitigate the performance deterioration induced by this limitation. Instead of having
seperate queues as pFabric demands, we could have a limited number of queues and either (a)
Move flows from one queue to another or (b) Dynamically update bandwidth of various queues.
TCAMs are fastaccess memory blocks that are used to store flow entries. An incoming packet is
matched with these entries at the switch and forwarded accordingly. In OpenFlow switches, these
flow entries store rich statistical data which can be used to process flows however, the main
problem comes about when we try to store and process a large number of flow entries. TCAM
space is limited and datacenter operators usually aggregate flow entries to curb this limitation.
However, finegrained flow entry installation would result in greatly improved flowscheduling. In
the next section, we will present a few designs based on OpenFlow controllers to counter this
limitation.
To confirm our motivation, we performed a simple experiment using singlequeue and multiple
queue links. The multiple queue link system is a priority queueing system with dynamic allocation
4. of flows from higher to lower queues as they exceed a range through an OpenFlow [10] controller.
Figure 2 confirms that we can have considerable benefits of using a multiplequeue system when
dealing with datacenter workloads.
Figure 2: AFCTs of singlequeue switch links versus multiple queue switch links
Design Considerations:
In this section, we talk about the various designs that we believe can be employed to counter
challenges faced when using either priority queues or weightedfair queues. In all our discussion,
when we mention priority queues, we mean strictpriority scheduling queues. Therefore,
whenever packets are available at the highest priority queue, our switch will always serve it with
the full bandwidth available and only provide bandwidth to the other queue when no packets are
enqueued at the higher priority queue. Our weightedfair queues are simple weighted roundrobin
queues where the bandwidth is divided on the basis of number of flows in each queue.
Firstly, we start off with some comments on priority and weightedfair queues.
Priority queues provide optimal solutions to short flow latency but at the same time, result in
degraded performance for long flows. Long flows require consistent throughput and it is difficult
to see how any priority queueing scheme can provide such a guarantee to long flows which
invariably get serviced in lower priority queues.
Weightedfair queues, on the other hand, present a balance between having a single queue and
multiple priority queueing schemes. Network operators can service flows according to their
requirements. However, to implement a pure weightedfair queueing scheme as proposed by
MulBuff [4], the switch would require an accurate estimate of the number of flows in each queue
and a network controller would have to adjust the bandwidth in each of the given queues.
As can be inferred from the discussion in the previous sections, we can gain certain benefits by
using SDNs, irrespective of the queueing scheme being employed. Now, we will discuss two main
design ideas for flow differentiation in datacenters and how they can be adapted for priority and
weightedfair queues. One of them is called the reactive approach and the other is called the
proactive approach.
5. One thing to be kept in mind here is that these ideas have nothing to do with the type of queues
involved but rather are ways to solve the issue of limited TCAM space. They both involve the use
of SDN controllers though. These approaches were proposed by Fernandez et. al [16].
Reactive Approach:
In this approach, the first packet from each flow is routed to the controller. In this way, the
controller can install flowspecific entries for each flow. This allows more finegrained control with
the controller having accurate or close to accurate estimation of the number of flows and also the
size of flows, derived from OpenFlow statistics. In this way, if we are using a priority queueing
scheme, we can accurately move a flow from one queue to the other when it exceeds a certain
limit. And for weightedfair queues, we have a more accurate estimate of the number of flows in
each queue for accurate bandwidth division.
However, there are a few tradeoffs to consider when using this approach. Firstly, the cost of flow
setup is high and any scheme should be able to justify the cost of doing so. Another point to be
considered is that the controller also needs to keep track of the flow entries that are going stale
and should have an appropriate solution which involves either (a) Aggregating previous flow
entries or (b) Deleting previous flow entries. Also, switches are not able to handle more than a few
hundred OpenFlow messages per second so communication has to be optimized.
Proactive Approach:
In this approach, the controller prepopulates the list of flow entries based on some metric that
the network operator provides and performs flow scheduling on those flow entries. This approach
is similar to the one that is currently employed in datacenters. The main difference here is that the
controller still has visibility over the current state of affairs at the switch through flow statistics but
the visibility is rather limited. Another benefit that cannot be ignored is that there is no flow setup
time when compared to the reactive approach.
However, the tradeoffs still remain. Our visibility is limited and our estimate is inaccurate. This
can, at times, lead to worseoff performance especially when considering priority queueing
schemes. Suppose a short flow gets mapped to a lower priority queue and with an inaccurate
estimate, the probability of that taking place is actually quite high. As far as weightedfair queues
are concerned, their performance can also see downfalls since inaccurate bandwidth division
would take place.
Hence, it can be seen that both approaches have their benefits and their disadvantages. In the
next section we will compare the two approaches and try to see the performance under varying
load conditions of both designs.
Evaluation:
Experimental Setup:
All of our experiments were conducted using Mininet [12], which is an Open Source tool to
emulate network topologies on your computer. We used Linux Traffic Control (TC) [13] module to
create various queues on an Open vSwitch [14]. The controller that we used for the
6. experimentation was POX [15], which is a pythonbased controller optimized for OpenFlow
version 1.0. All the experiments have been carried out using OpenFlow Standard 1.0.
Our testbed consisted of a topology of 10 servers connected to 1 client. Figure 1. Shows our test
topology. The client requested data from the servers based on the websearch distribution that
was found out by Alizadeh et. al. [2] to be prevalent in various datacenter networks. We have
compared both reactive and proactive techniques with both priority and weightedfair queues.
Our experiments were run 5 times each and the average is taken for analysis. In all our
experiments, we detect and marginalize long flows by moving flows that exceed a certain limit
(1MB per flow) to the other queue.
Figure 3: Topology used for Mininet Experiments
Experimental Results:
Priority Queues:
When dealing with priority queues, we have seen the following trends. Presented below are the
graphs for the average flow completion times, short flow completion times and long flow
completion times.
Figure 4: Average Flow Completions Times for priority queues
7. As figure 4, shows there is very little difference between the average flow completion times when
using priority queues. That is to be expected since the overall queueing principle is the same and
the underlying technique that is employed i.e. reactive or proactive only affects the detection of
short flows and long flows. Therefore, the completion times of short flows and long flows would
most likely change.
Figure 5: Long Flow Completion Times, Priority Queues
Figure 5 shows the completion times of long flows. Here, we can see the main difference between
the reactive and proactive approach when it comes to OpenFlow. As can be inferred from the
graph, the proactive approach results in much better flow completion times for long flows. The
reason for that is since we have aggregated flow entries, we have an inaccurate estimate. As soon
as we move an entry to the highest priority queue, all flows matching that entry will get serviced
in the higher priority queues. Therefore, a number of long flows will be served alongside short
flows. As far as the reactive approach is concerned, since long flow detection is much more
accurate, the long flows get sent to the lower priority queue much early which results in their
completion times being much more.
Figure 5: Short Flow Completion Times, Priority Queues
Figure 6 shows the results for short flow completion times. It shows that the reactive approach
decreases short flow completion. Once again, this is because we see that reactive approach is able
8. to more accurately discover and partition short and long flows. However, since there is a flow
setup cost, the tradeoff induces some penalty and the results remain quite close for both
scenarios. The average flow completion times remain close to each other because our workloads
generate more short flows and they compensate for latency occurred with long flows.
Future Work:
As our experimentation has revealed, there are tradeoffs to be kept in mind when using proactive
or reactive approach as far as OpenFlow based SDNenabled switches are concerned. It now
depends on the network operator to decide which queueing system is more appropriate for his
needs.
Whenever a differentiated services system is adopted, certain things need to be kept in mind. The
first is fairness. There can be made a case where, especially in enterprise datacenters, where it
would be wrong to provide different services to customers paying equally. However, if we think
about the payasyouuse model employed these days, certain customers could be incentivised by
better performance if they pay more for it. Ultimately, it is up to the network operator to decide
whether he sees the benefit of using such a model or not.
Another thing that should be noted here is that it is easy to game such systems. In case of priority
queues, an adversarial node could simply divide its larger flows into smaller portions and make
sure that it enjoys service in the higher priority queue. Also, the SDN controller is on a machine,
which if hacked, can wreak havoc on the system. Therefore, there needs to be a security
mechanism in place to defend against such attacks.
Conclusion:
In this work, we propose that using SDNs to achieve flow differentiation in datacenters is an
option worth considering. We have shown that flow differentiation is beneficial as far as flow
completion times are concerned. But, there are various challenges to be incurred in trying to have
flow differentiation in current switches and we present various ways in which SDNs can help
achieve that goal. We also show how OpenFlow, as a popular SDN model, can be used to achieve
flow differentiation. We experiment on the two different approaches, reactive and proactive, and
show that there are various tradeoffs to consider. In the end, we talk about some further
challenges to be kept in mind while deploying such a system.
References:
1. Abts, D., & Felderman, B. (2012). A guided tour of datacenter networking.
Communications of the ACM, 55(6), 4451.
2. Alizadeh, M., Greenberg, A., Maltz, D. A., Padhye, J., Patel, P., Prabhakar, B., ... &
Sridharan, M. (2011). Data center tcp (dctcp). ACM SIGCOMM computer communication
review, 41(4), 6374.
3. Alizadeh, M., Yang, S., Sharif, M., Katti, S., McKeown, N., Prabhakar, B., & Shenker, S.
(2013). pfabric: Minimal nearoptimal datacenter transport. ACM SIGCOMM Computer
Communication Review, 43(4), 435446.
9. 4. Mushtaq, A., Ismail, A. K., Wasay, A., Mahmood, B., Qazi, I. A., & Uzmi, Z. A. (2014,
August). Rethinking buffer management in data center networks. In ACM SIGCOMM
Computer Communication Review (Vol. 44, No. 4, pp. 575576). ACM.
5. Bai, W., Chen, L., Chen, K., Han, D., Tian, C., & Wang, H. (2015). Informationagnostic flow
scheduling for commodity data centers. In 12th USENIX Symposium on Networked
Systems Design and Implementation (NSDI 15) (pp. 455468).
6. Munir, A., Qazi, I. A., Uzmi, Z. A., Mushtaq, A., Ismail, S. N., Iqbal, M. S., & Khan, B. (2013,
April). Minimizing flow completion times in data centers. In INFOCOM, 2013 Proceedings
IEEE (pp. 21572165). IEEE.
7. Munir, A., Baig, G., Irteza, S. M., Qazi, I. A., Liu, A. X., & Dogar, F. R. (2014, August). Friends,
not foes: synthesizing existing transport strategies for data center networks. In ACM
SIGCOMM Computer Communication Review (Vol. 44, No. 4, pp. 491502). ACM.
8. Auchmann, M., & UrvoyKeller, G. (2008). On the variance of the least attained service
policy and its use in multiple bottleneck networks. In Network Control and Optimization
(pp. 7077). Springer Berlin Heidelberg.
9. Aalto, S., & Ayesta, U. (2009). SRPT applied to bandwidthsharing networks. Annals of
Operations Research, 170(1), 319.
10. Benson, T., Akella, A., & Maltz, D. A. (2010, November). Network traffic characteristics of
data centers in the wild. In Proceedings of the 10th ACM SIGCOMM conference on
Internet measurement (pp. 267280). ACM.
11. McKeown, N., Anderson, T., Balakrishnan, H., Parulkar, G., Peterson, L., Rexford, J., ... &
Turner, J. (2008). OpenFlow: enabling innovation in campus networks. ACM SIGCOMM
Computer Communication Review, 38(2), 6974.
12. Lantz, B., Heller, B., & McKeown, N. (2010, October). A network in a laptop: rapid
prototyping for softwaredefined networks. In Proceedings of the 9th ACM SIGCOMM
Workshop on Hot Topics in Networks (p. 19). ACM.
13. Almesberger, W. (1999). Linux network traffic controlimplementation overview. In 5th
Annual Linux Expo (No. LCACONF1999012, pp. 153164).
14. Pfaff, B., Pettit, J., Amidon, K., Casado, M., Koponen, T., & Shenker, S. (2009, October).
Extending Networking into the Virtualization Layer. In Hotnets.
15. The POX Controller
https://github.com/noxrepo/pox
16. Fernandez, M. P. (2013, March). Comparing openflow controller paradigms scalability:
Reactive and proactive. In Advanced Information Networking and Applications (AINA),
2013 IEEE 27th International Conference on (pp. 10091016). IEEE.