1. Quota enforcement for high-performance
distributed storage systems
Kristal T. Pollack, Darrell D. E. Long, Richard A. Golding, Benjamin Reed, Ralph A. Becker-Szendy
IBM Almaden Research Center, San Jose, CA
Abstract cess control.
Existing quota systems trade off scalability and accu-
Storage systems manage quota to ensure that each user racy. A centralized quota tracking server can be byte-
gets the storage they need, and that no one user can—even accurate as long as each client informs the server on each
by accident—use up all available storage. This is diffi- resource allocation, which inhibits scalable performance.
cult for large, distributed systems, especially those used Other systems use a centralized server but relax accuracy,
for high-performance computing applications, because re- either by tracking quota only in large granules or by us-
source allocation occurs on many nodes concurrently. We ing time-limited escrow mechanisms (which set aside a
present a scheme where quota is enforced asynchronously certain amount of resource for a client for a limited du-
by intelligent storage servers: storage clients contact a ration), both of which reduce the frequency with which a
shared management service to get vouchers, a capability- client must interact with the quota tracking server.
like certificate that the clients can redeem at participating The existing quota systems also provide for a sin-
storage servers to allocate storage space. This approach gle quota-related policy for all clients. In a distributed
produces low load on the shared management service, file system like SanFS [10], for example, the centralized
promotes good scaling, and allows the client to make de- metadata server decides which logical disks in a storage
cisions about which storage server(s) to use without com- pool a client should be allocated when it needs storage.
municating with the management service for further ap- This requires that the policy not only have an accurate
proval. Storage servers and the management service peri- record of how much quota each user has consumed, but
odically reconcile voucher usage to ensure that clients do also an accurate map of how much resource is available
not cheat by spending the same voucher at multiple stor- on every server on which that resource could be allocated.
age servers. We report on a simulation study that shows We propose an alternative approach to tracking and en-
that this approach gives performance nearly as good as forcing resource limits. This approach borrows from mi-
not enforcing quota at all, and that the load on the shared crocash mechanisms: there is a centralized server that
management server is remarkably low. acts as a bank that issues vouchers to clients, which the
client can spend to allocate resources on whatever server
they want. The client can withdraw enough vouchers to
1 Introduction cover their needs for some period, during which time the
client does not need to contact the bank. Servers are able
Tracking and enforcing resource usage limits in a large to check the vouchers for authenticity. The vouchers are
distributed system is difficult because it requires maintain- valid for a limited time, in order to handle clients that fail,
ing a consistent view of total usage when consumption is and servers periodically reconcile their transactions with
occurring in several places concurrently. In a file system, the bank to check that clients have behaved correctly.
for example, users must not use more than their storage This approach provides a different tradeoff than other
quota. Many scientific applications involve tens of thou- quota servers. It provides excellent scalability—in most
sands of nodes all cooperating on a problem, all writing to cases indistinguishable from not tracking resource us-
shared files and consuming from the same pool of quota. age at all—while providing byte-accurate but temporally-
The file system is typically built as a small cluster of meta- coarse accuracy similar to time-limited escrow. It reduces
data servers and a larger number of storage servers or disk load on the centralized tracking service well below that of
arrays. We concentrate here on systems that use storage other mechanisms. It also decouples quota tracking from
servers that provide intelligence similar to object storage, allocation policy, so that the quota server only needs to
and so can track local storage allocation and enforce ac- track how much quota a user has consumed, and does not
1
2. decisions that give the best result for the application the
C l i e n t
client is running (self-interest), while ensuring that users
do not go over quota and that storage resources are not
A u t h o r i z a t i o n A c c e s s
over-used (community interest).
Different files can have significantly different needs,
and so the system allows a different layout for each file.
m
a n a
F
g
i l
e
e
m
e n t
D
i s k
One file may be located on a single storage server; another
may be mirrored and striped across many storage servers.
i s k
D
S t o r a g e
l u s t e r
c
s e r v e r s
The client decides what the layout should be, based on
expected needs derived either from application hints or
inferred from file attributes. Peak file creation rates can
be high in some scientific applications, and so it is impor-
tant for good scalability to minimize the dependence on
Figure 1: Basic distributed storage system architecture. the shared file management service during file creation.
Scientific applications have characteristics different
from what studies of end-user workstations have shown.
need to be concerned with where that consumption has The absolute numbers are several orders of magnitude
occurred, which reduces the load on the quota server and larger: petabytes of data are being deployed now, with
makes it easier to partition the quota tracking work across aggregate transfer rates of gigabytes per second, files in
multiple servers. Further, each client can decide for itself the terabytes, all being accessed by tens of thousands of
where to allocate resources based on its own needs which clients. The clients are cooperating to run one application,
allows a client to customize its allocation based, for ex- and both read- and write-share files. The applications are
ample, on how a particular file will be used. bursty, as the clients synchronize as they move through
phases of a computation and write out checkpoints con-
currently or read the results of previous phases. Some of
2 System context the files are only temporary, for communication between
computation phases, while others are results of days of
Figure 1 shows the architecture of the distributed storage computation and must be carefully protected.
systems we are investigating. In these systems, clients act Because K2 is built for large distributed environments,
on behalf of users. The clients communicate with a file its design works to minimise the trust required in any one
system management service cluster to locate files and au- component—in particular, the client. While many clients
thorize actions. The authorization includes both checking may be part of a single homogeneous compute cluster,
permission to access data and permission to consume re- some clients will be different and potentially not under
sources. The authorization is expressed using location- careful administrative care—for example, user worksta-
independent capabilities [11] and vouchers, which en- tions used for visualizing results. The system assumes
code the client’s rights to access files and to allocate re- that clients authenticate the users that run on them, and
sources respectively. Once the client has the capabilities that the clients can provide evidence of that authentica-
and vouchers it needs, it communicates directly with stor- tion when communicating with the file management ser-
age servers to read and write data and to create and delete vice and with storage server [9]. Our design assumes that
files. The storage server has the intelligence to manage clients can crash-fail. While some clients can also be ma-
internal resource allocation and to check capabilities for licious, we do not focus on them; for example, we do
validity, similar to object store model [8, 5]. not provide data access that can survive Byzantine behav-
We are investigating quota management as part of the ior [1]. However, for resource quota management we do
K2 distributed storage system. For scalability reasons, K2 bound the effect that any client can have on the system.
pushes decentralization as far as possible. Each node is
an autonomous agent, acting in its own interest as much
as possible while respecting community needs. We make 3 Protocol operation
this possible by each node acting as an enlightened ra-
tional agent, with algorithms that work to meet the node’s In this section we give an overview of the operation of the
needs while avoiding the “tragedy of the commons,” [6] in voucher-based quota system. Figure 2 shows the general
which the limits on shared resources are not considered. flow of usage. A client first requests a voucher for storage
In concrete terms for a storage system, this means that resources from the quota server for the user, then sends
we want each client to be able to make its own allocation IO requests to storage nodes, including the voucher when
2
3. fi l e m
(
a
b
n
a
a
n
g
k
e
)
m e n t
( f o
c
r
l i
u
e
s
n
e
t
r
)
s
s
e
t
o
r v
r
a
e
g
r
e
1 s
s
e
t
o
r v
r
a
e
g
r
e
2
Getting vouchers. While a client could ask the man-
r e q u e s t ( u s e r i d , a m o u n t )
agement server for quota authorization on every I/O,
v
o u c h e r
this would put an unreasonable load on the management
I O
r e q u e s t (
v
o u c h e r )
server. Instead, the client maintains a pool of vouchers,
I O
r e q u e s t (
v
o u c h e r )
and only periodically communicates with the manage-
ment server. The client tries to maintain enough vouchers
a l l o c a t e
r e s o u r c e
l e f t o e r
v
to cover any allocation it expects to do in the near future,
c h e c k u s a g e
while allowing for other clients to share quota. This ap-
proach reduces load on the management server and im-
proves client response latency.
The client has to decide when to request vouchers and
Figure 2: Sequence of operations. A client begins by ob- how much resource to ask for; the management server
taining a voucher for a user from the quota server, then has to decide how much of that request to grant. The
spending that voucher during IO requests to different stor- management server must maintain the invariant that the
age nodes. Later, the quota server and storage nodes check vouchers granted for a user to any client—which repre-
that the client did not overuse a voucher. sent potentially-used resources—plus the amount actually
allocated do not go over the user’s quota.
While there are many possible policies for deciding
those requests may consume resources. If a client frees when and how much to ask for, we focus on a client re-
resources on a storage server, the storage server gives the questing quota from the management server on a regular
client a “refund” voucher for the amount freed. The quota schedule. This generally makes the load on the manage-
server and storage nodes periodically reconcile the set of ment server proportional to the number of clients, rather
vouchers that have been spent against those that have been than to the intensity of workload on those clients. The
issued, in order to detect clients that overuse a voucher. client uses its history of recent resource consumption to
estimate how much it will likely use between the current
Vouchers. A voucher is a record of a decision to allow request and the next request, and asks the management
a client to consume resources on behalf of a particular server for the difference between the estimated usage and
user. It is represented as a cryptographically-protected se- the vouchers it already has on hand.
quence of bytes: If actual usage is higher than anticipated, then the client
will have to ask the management server for extra vouchers
{epoch, expiry, user, amount, serial}auth before its next scheduled request. The client can estimate
the management server’s response time and the short-term
similar to capabilities used to authorize actions in
voucher usage rate to predict when to send a request to the
Amoeba [11] and the T10 OSD [8]. User and amount are
management server before the client runs out of vouchers.
obvious. The voucher has a unique serial number, which
is used when storage servers reconcile voucher usage with The client may have extra vouchers when net consump-
the management server. Each voucher also records when tion is lower than anticipated—perhaps because it has
it was issued (the epoch) and when it expires. been freeing rather than allocating storage. In that case
The voucher includes a signature or MAC generated the client can return some vouchers to the management
using a secret key known only to the management and server, making the quota available for other clients. This
storage servers, which ensures that a client cannot forge matters, for example, when one client is cleaning up old
a voucher. However, it does not prevent one client from files while other clients are writing new data.
eavesdropping on another, and so vouchers must be trans- The management server determines how much to grant
mitted only over private channels. Issues like avoiding re- to a client based on its global information, including the
play attacks do not require special mechanisms in vouch- total amount of vouchers outstanding for a user and es-
ers if they are used in conjunction with authorization ca- timated demand from all clients consuming that user’s
pabilities that provide defense against replay. quota. Granting more to a client can reduce the number of
Each voucher is valid for only a limited duration, request messages that the management server must pro-
as recorded in its expiry field. This is used in han- cess, but giving too much to one client can inhibit sharing
dling failure—if a client crashes while holding an un- across multiple clients. The management server must also
used voucher, other clients can use that quota once the not issue enough vouchers that a user could go over quota.
voucher expires—and in reconciling storage and manage- One reasonable heuristic is for the management server
ment servers. These are discussed further below. to give each client an amount proportional to its con-
3
4. sumption rate, while reserving some quota in case new K
u
clients begin using quota. The client policy discussed
n e x i r e d
p
e x i r e d
p
X
above makes requests approximately proportional to con-
sumption rate, and so the management server can give
R e c o n c i l e d 2 3 2 4 2 5 2 6 2 7
each client the same fraction f of their requested amount. ( j u s
t t o t a l
s )
U n
r
e c o n c i l e d
C u r r
e n t
When there is plenty of quota, f = 1. As the number of e p o c h s e p o c h
clients increases or the amount of remaining quota de- (
t o t a l
s
a n d v o
u c h
e r
s
creases, each client gets a fraction f = r/((n + 1)r) of
c h )
e r e o
p p
t i m e
their requests, where r is this client’s consumption rate, n
is the number of clients consuming from that quota, and r
is the average consumption rate over all active clients.
Figure 3: How the storage server maintains consumption
Using vouchers. Once a client has obtained a voucher, information over multiple epochs. X is the number of
it can use the voucher to consume resources. The client epochs before vouchers expire; K is the number of epochs
picks which storage server it will use; the problem of se- in the past when reconciliation happens.
lecting the server is outside the scope of this paper.
In the simplest way of using vouchers, the client sends
its I/O request to the storage server, along with one or The system divides time into epochs, as illustrated in
more vouchers that will cover any resource allocationsFigure 3. Each voucher is associated with the epoch in
the I/O request might require. The storage server keeps
which it was issued, and the storage server keeps a list of
track of how much resource was actually consumed by vouchers that it has received for each epoch. It also tracks
the request, and may send the client a new voucher forhow much resource was consumed against those vouch-
any balance in its reply. The storage server keeps track
ers in the epoch. At some point there can be no more
of how much each user has consumed, plus any recently-activity associated with an epoch—because all vouchers
spent vouchers. The vouchers are periodically reconciled
from that epoch will have expired—and the storage server
with the management server in order to handle failure or
can reconcile the list of vouchers used for that epoch with
to catch cheaters, as discussed below. the management server. After reconciliation, the storage
server can get rid of the list of vouchers and merge that
Consider a simple scenario: a client is trying to write
1 MB of data into an existing file. The client obtains epoch’s consumption information into the record of over-
a voucher for (say) 2 MB from the management server, all reconciled consumption.
then sends a write request to the storage server along with
We summarize the formal model for tracking and rec-
the voucher. The storage server determines how much re-
onciling quota for a single user as follows. (The expo-
source is consumed. It might consume nothing, if the re-
sition for a single user is clearer than for multiple users,
quest only overwrites already-allocated blocks, or it might
but the rules are the same.) The user has a quota Q, and
consume a full 1 MB, or something in between. The stor-
the management server has authorized some allocation A;
age server will reply with a refund voucher for 2 MB mi-
the system works to keep A ≤ Q. The allocation A is the
nus the amount actually allocated. amount of storage used on all storage servers plus the
Vouchers can also be used in a somewhat different amount of any unredeemed vouchers: A = ∑∀d Sd + U,
way to solve a long-standing problem in object storagewhere Sd is the amount used on storage server d and U is
systems—ensuring that an operation will succeed when the amount of unredeemed vouchers.
multiple clients could be consuming resources in the stor-
For the management server to know A accurately us-
age server. A client can use a voucher to reserve resources
ing this definition, it would need to be involved syn-
at the storage server to ensure that its later operations will
chronously on every resource allocation, which defeats
have the resources to complete. This is particularly im-
the intent of the voucher approach. Instead, the man-
portant for object stores because it is hard for a client to
agement server uses a conservative estimate of A: in-
predict how much resource any one operation will con- stead of the actual current usage ∑ Sd , it uses the us-
sume. age determined at the last reconciliation, and instead of
the actual unredeemed vouchers, it uses all vouchers is-
Tracking and reconciling usage. The storage server sued since the last reconciliation. Formally, the manage-
keeps track of how much a user has consumed, peri- ment server knows the amount of resource the user had
odically reconciles voucher usage with the management consumed as of the last reconciliation, which covered all
server to catch cheaters and recover from failures. e
epochs up to and including epoch e: ∑∀d Sd , where e de-
4