- The document analyzes the performance costs of TLS (Transport Layer Security) on web servers through experiments replacing TLS components with "no-ops" to measure throughput.
- Two workloads were tested - an Amazon-like workload simulating an e-commerce site, and a trace from the author's departmental web server.
- Results showed RSA operations are a major bottleneck, and session caching helps reuse cryptographic results and improve performance. Optimizing individual TLS components could yield significant speedups.
Installation of MySQL 5.1 Cluster Software on the Solaris 10 ...
Performance Analysis of TLS Web Servers Reveals Optimization Opportunities
1. Performance Analysis of TLS Web Servers
Cristian Coarfa, Peter Druschel and Dan S. Wallach
Department of Computer Science
Rice University
Abstract ciphers, and message integrity functions. In its most com-
mon web usage, TLS uses 1024-bit RSA encryption to
transmit a secret that serves to initialize a 128-bit RC4
TLS is the protocol of choice for securing today’s e- stream cipher and uses MD5 as a keyed hash function.
commerce and online transactions, but adding TLS to a (Details of these algorithms can be found in Schneier [25]
web server imposes a significant overhead relative to an and most other introductory cryptography texts.)
insecure web server on the same platform. We perform TLS web servers incur a significant performance
a comprehensive study of the performance costs of TLS. penalty relative to a regular web server running on the
Our methodology is to profile TLS web servers with trace- same platform (as little as a factor of 3.4 to as much as
driven workloads, replacing individual components inside a factor of 9, in our own experiments). As a result of
TLS with no-ops, and measuring the observed increase in this cost, a number of hardware accelerators are offered
server throughput. We estimate the relative costs of each by vendors such as nCipher, Broadcom, Alteon and Com-
component within TLS, predicting the areas for which fu- paq’s Atalla division. These accelerators take the modular
ture optimizations would be worthwhile. Our results we exponentiation operations of RSA and perform them in
show that RSA accelerators are effective for e-commerce custom hardware, thus freeing the CPU for other tasks.
site workloads , because they experience low TLS ses- Researchers have also studied algorithms and systems
sion reuse. Accelerators appear to be less effective for to accelerate RSA operations. Boneh and Shacham [8]
sites where all the requests are handled by a TLS server, have designed a software system to perform RSA opera-
thus having higher session reuse rate; investing in a faster tions together in batches, at a lower cost than doing the
CPU might prove more effective. operations individually. Dean et al. [9] have designed a
network service, offloading the RSA computations from
web servers to dedicated servers with RSA hardware.
1. Introduction A more global approach was to distribute the TLS pro-
cessing stages among multiple machines. Mraz [16] has
Secure communication is an intrinsic demand of to-
designed an architecture for high volume TLS Internet
day’s world of online transactions. The most widely
servers that offloads the RSA processing and bulk cipher-
used method is SSL/TLS [10]. Original designed at
ing to dedicated servers.
Netscape for its web browsers and servers, Netscape’s Se-
The TLS designers knew that RSA was expensive and
cure Socket Layer (SSL) has been standardized by the
that web browsers tend to reconnect many times to the
IETF and is now called Transport Layer Security (TLS).
same web server. To address this, they added a cache, al-
TLS runs at the transport layer above existing protocols
lowing subsequent connections to resume an earlier TLS
like TCP. TLS is used in a variety of application, including
session and thus reuse the result of an earlier RSA com-
secure web servers, secure shell and secure mail servers.
putation. Research has suggested that, indeed, session
As TLS is most commonly used for secure web applica-
caching helps web server performance [11].
tions, such as online banking and e-commerce, our goal is
Likewise, there has been considerable prior work in per-
to provide a comprehensive performance analysis of TLS
formance analysis and benchmarking of conventional web
web servers. While previous attempts to understand TLS
servers [15, 12, 17, 5, 18], performance optimizations of
performance have focused on specific processing stages,
web servers, performance oriented web server design, and
such as the RSA operations or the session cache, we ana-
operating system support for web servers [13, 22, 6, 7, 21].
lyze TLS web servers as systems, measuring page-serving
Apostolopuolos et al. [3] studied the cost of TLS con-
throughput under trace-driven workloads.
nection setup, RC4 and MD5, and proposed TLS connec-
TLS provides a flexible architecture that supports a
tion setup protocol changes.
number of different public key ciphers, bulk encryption
2. Our methodology is to replace each individual opera- Allow the client and server to verify that their peer
tion within TLS with a “no-op” and measure the incre- has calculated the same security parameters and that
mental improvement in server throughput. This method- the handshake occurred without tampering by an at-
ology measures the upper-bound that may be achieved by tacker.
optimizing each operation within TLS, whether through
hardware or software acceleration techniques. We can There are several important points here. First, the TLS
measure the upper-bound on a wide variety of possible protocol designers were aware that performing the full
optimizations, including radical changes like reducing the setup protocol is quite expensive, requiring two network
number of TLS protocol messages. Creating such an op- round-trips (four messages) as well as expensive cryp-
timized protocol and proving it to be secure would be a tographic operations, such as the 1024-bit modular ex-
significant effort, whereas our simulations let us rapidly ponentiation required of RSA. For this reason, the pre-
measure an upper bound on the achievable performance master secret can be stored by both sides in a session
benefit. If the benefit were minimal, we would then see cache. When a client subsequently reconnects, it need
no need for designing such a protocol. only present a session identifier. Then, the premaster se-
Section 2 presents an overview of the TLS protocol. cret (known to client and server but not to any eaves-
Section 3 explains how we performed our experiments and dropper) can be used to create a new master secret, a
what we measured. Section 4 analyzes our measurements connection-specific value from which the connection’s en-
in detail. Our paper wraps up with future work and con- cryption keys, message authentication keys, and initializa-
clusions. tion vectors are derived.
After the setup protocol is completed, the data exchange
phase begins. Prior to transmission, the data is broken into
2. TLS protocol overview
packets. For each packet, the packet is optionally com-
The TLS protocol, which encompasses everything from pressed, a keyed message authentication code is computed
authentication and key management to encryption and in- and added to the message with its sequence number. Fi-
tegrity checking, fundamentally has two phases of opera- nally the packet is encrypted and transmitted. TLS also
tion: connection setup and steady-state communication. allows for a number of control messages to be transmit-
Connection setup in quite complex. Readers looking ted.
for complete details are encouraged to read the RFC [10]. Analyzing the above information, we see a number of
The setup protocol must, among other things, be strong operations that may form potential performance bottle-
against active attackers trying to corrupt the initial negoti- necks. Performance can be affected by the CPU costs of
ation where the two sides agree on key material. Likewise, the RSA operations and the effectiveness of the session
it must prevent “replay attacks” where an adversary who cache. It can also be affected by the network latency of
recorded a previous communication (perhaps one indicat- transmitting the extra connection setup messages, as well
ing some money is to be transferred) could play it back as the CPU latency of marshaling, encrypting, decrypting,
without the server’s realizing the transaction is no longer unmarshaling, and verifying packets. This paper aims to
fresh (and thus, allowing the attacker to empty out the vic- quantify these costs.
tim’s bank account).
TLS connection setup has the following steps (quoting 3. Methodology
from the RFC):
We chose not to perform “micro-benchmarks” such as
Exchange hello messages to agree on algorithms, measuring the necessary CPU time to perform specific op-
exchange random values, and check for session re- erations. In a system as complex as a web server, I/O and
sumption. computation are happening simultaneously and the sys-
tem’s bottleneck is never intuitively obvious. Instead, we
Exchange certificates and cryptographic information
chose to measure the throughput of the web server un-
to allow the client and server to authenticate them- der various conditions. To measure the costs of individ-
selves. [In our experiments, we do not use client cer-
ual operations, we replaced them with no-ops. Replac-
tificates.]
ing cryptographically significant operations with no-ops
Exchange the necessary cryptographic parameters to is obviously insecure, but it allows us to measure an upper
allow the client and server to agree on a “premaster bound on the performance that would result from optimiz-
secret”. ing the system. In effect, we simulate ideal hardware ac-
celerators. Based on these numbers, we can estimate the
Generate a “master secret” from the premaster secret relative cost of each operation using Amdahl’s Law (see
chosen by the client and exchanged random values. Section 4).
3. 3.1. Platform The Amazon workload has a small average file size, 7 KB,
while the CS trace has a large average file size, 46KB.
Our experiments used two different hardware platforms
Likewise, the working size of the CS trace is 530MB
for the TLS web servers: a generic 500MHz Pentium III
while the Amazon trace’s working size is only 279KB.
clone and a Compaq DL360 server with a single 933MHz
Even with the data stored in RAM buffers, these two
Pentium III. Both machines had 1GB of RAM and a gi-
configurations provide quite different stresses upon the
gabit Ethernet interface. Some experiments also included
system. For example, the Amazon trace will likely be
a Compaq AXL300 [4] cryptography acceleration board.
stored in the CPU’s cache whereas the CS trace will gen-
Three generic 800MHz Athlon PCs with gigabit Ethernet
erate more memory traffic. The Amazon trace thus places
cards served as TLS web clients, and all experiments were
similar pressure on the memory system as we might ex-
performed using a private gigabit Ethernet switch.
pect from dynamically generated HTML (minus the costs
All computers ran RedHat Linux 6.2. The stan-
of actually fetching the data from an external database
dard web servers used were Apache 1.3.14 [2], and
server). Likewise, the CS trace may put more stress on
the TLS web server was Apache with mod SSL 2.7.1-
the bulk ciphers, with its larger files, whereas the Amazon
1.3.14 [14]. We have chosen the Apache mod SSL so-
trace would put more pressure on the connection setup
lution due to its wide availability and use, as shown by
costs, as these connections will be, on average, much
a March 2001 survey [26]. The TLS implementation
shorter lived.
used in our experiments by mod SSL is the open source
In addition to replacing cryptographic operations, such
OpenSSL 0.9.5a [19]. The HTTPS traffic load was gen-
as RSA, RC4, MD5/SHA-1, and secure pseudo-random
erated using the methodology of Banga et al. [5], with
number generation with no-ops1, we also investigated
additional support for OpenSSL. As we are interested pri-
replacing the session cache with an idealized “perfect
marily in studying the CPU performance bottlenecks aris-
cache” that returns the same session every time (thus
ing from the use of cryptographic protocols, we needed
avoiding contention costs in the shared memory cache).
to guarantee that other potential bottlenecks, such as disk
Simplifying further, we created a “skeleton TLS” proto-
or network throughput, did not cloud our throughput mea-
col where all TLS operations have been completely re-
surements. To address this, we used significantly more
moved but the messages of the same length as the TLS
RAM in each computer than it’s working set, and thus
handshake are transmitted. This simulates an “infinitely
minimizing disk I/O when the disk caches are warm. Like-
fast” CPU that still needs to perform all the same network
wise, to avoid network contention, we used gigabit Ether-
operations. Finally, we hypothesize a faster TLS session
net, which provide more bandwidth than the computers in
resumption protocol that removes two messages (one net-
our study can reasonably generate.
work round-trip), and measure its performance.
3.2. Experiments performed Through each of these changes, we can progressively
simulate the effects of “perfect” optimizations, identifying
We performed four sets of experiments, using two dif- an upper bound on the benefits available from optimizing
ferent workload traces against two different machine con- each component of the TLS system.
figurations.
One workload simulated the secure servers at Ama- 3.2.1. Amazon-like workload experiments
zon.com. Normally, an Amazon customer selects goods
to be purchased via a normal web server, and only inter- We were interested in closely simulating the load that
acts with a secure web server when submitting credit card might be experienced by a popular e-commerce site, such
information and verifying purchase details. We purchased as Amazon. While our experiments do not include the
two books at Amazon, one as a new user and one as a database back-end processing that occurs in e-commerce
returning user. By replicating the corresponding HTTPS sites, we can still accurately model the front-end web
requests in the proportions that they are experienced by server load.
Amazon, we can simulate the load that a genuine Amazon To capture an appropriate trace, we configured a Squid
secure server might experience. Our other workload was a proxy server and logged the data as we purchased two
100,000-hit trace taken from our departmental web server, books from Amazon.com, one as a new customer and
using a 530MB set of files. While our departmental web one as a returning customer. The web traffic to browse
server supports only normal, unencrypted web service, we Amazon’s inventory and select the books for purchase oc-
measured the throughput for running this trace under TLS curs over a regular web server, and only the final payment
to determine the costs that would be incurred if our normal 1 While TLS also supports operating modes which use no encryption
web server was replaced with a TLS web server.
(e.g., TLS_NULL_WITH_NULL_NULL), our no-op replacements still
These two workloads represent endpoints of the work- use the original data structures, even if their values are now all zeros.
load spectrum TLS-secured web severs might experience. This results in a more accurate simulation of “perfect” acceleration.
4. and shipping portion occurs with a secure web server. Of pear to originate from a small number of proxies. To avoid
course, the traces we recorded do not contain any plain- an incorrect estimation of the session reuse, we hand-
text from the secure web traffic, but they do indicate the deleted all known proxy servers from our traces. The
number of requests made and the size of the objects trans- remaining requests could then be assumed to correspond
mitted by Amazon to the browser. This is sufficient in- to individual users’ web browsers. The final trace con-
formation to synthesize a workload comparable to what tained approximately 11,000 sessions spread over 100,000
Amazon’s secure web servers might experience. The only requests.
value we could not directly measure is the ratio of new In our trace playback system, three client machines ran
to returning Amazon customers. Luckily, Amazon pro- 20 processes each, generating 60 simultaneous connects,
vided this ratio (78% returning customers to 22% new proving sufficient to saturate the server. The complexity
customers) in a recent quarterly report [1]. For our exper- of the playback system lies in its attempt to preserve the
iments, we assume that returning customers do not retain original ordering of the web requests seen in the original
TLS session state, and will thus complete the full TLS trace. Apache’s logging mechanism actually records the
handshake every time they wish to make a purchase. In order in which requests complete, not the order in which
this scenario, based on our traces, the server must per- they were received. As such, we have insufficient infor-
form a full TLS handshake approximately once out of mation to faithfully replay the original trace in its origi-
every twelve web requests. This one-full-handshake-per- nal order. Instead, we derive a partial ordering from the
purchase assumption may cause us to overstate the relative trace. All requests from a given IP address are totally or-
costs of performing full TLS handshakes, but it does rep- dered, but requests from unrelated IP addresses have no
resent a “worst case” that could well occur in e-commerce ordering. This allows the system to dispatch requests in
workloads. a variety of different orders, but preserves the behavior of
We created files on disk to match the sizes collected in individual traces.
our trace and request those files in the order they appear in As a second constraint, we wished to enforce an upper
the trace. When replaying the traces, each client process bound on how far the final requests observed by the web
uses at most four simultaneous web connections, just as server may differ from the order of requests in the origi-
common web browsers do. We also group together the hits nal trace. If this bound were too small, it would artificially
corresponding to each complete web page (HTML files limit the concurrency that the trace playback system could
and inline images) and do not begin issuing requests for exploit. If the bound were too large, there would be less
the subsequent page until the current page is completely assurance that the request ordering observed by the exper-
loaded. All three client machine run 24 of these processes, imental server accurately reflected the original behavior
each, causing the server to experience a load comparable captured in the trace. In practice, we needed to set this
to 72 web clients making simultaneous connections. boundary at approximately 10% of the length of the origi-
nal trace. Tighter boundaries created situations where the
3.2.2. CS workload experiments server was no longer saturated, and the clients could begin
no new requests until some older large request, perhaps for
We also wished to measure the performance impact of a very large file, could complete.
replacing our departmental web server with a TLS web While this technique does not model the four simulta-
server. To do this, we needed to design a system to read neous connections performed by modern web browsers,
a trace taken from the original server and adapt it to our it does saturate the server sufficiently that we believe the
trace-driven TLS web client. Because we are interested server throughput numbers would not change appreciably.
in measuring maximum server throughput, we discarded
the timestamps in the server and instead replayed requests
4. Analysis of experimental results
from the trace as fast as possible. However, we needed to
determine which requests in the original trace would have Figures 1 and 2 show the main results of our exper-
required a full TLS handshake and which requests would iments with the Amazon trace and the CS trace, respec-
have reused the sessions established by those TLS hand- tively. The achieved throughput is shown on the y axis.
shakes. To do this, we assumed that all requests in the For each system configuration labeled along the x-axis,
trace that originated at the same IP address corresponded we show two bars, corresponding to the result obtained
to one web browser. The first request from a given IP with the 500MHz system and the 933MHz system, respec-
address must perform a full TLS handshake. Subsequent tively.
requests from that address could reuse the previously ac- Three clusters of bar graphs are shown along the x-axis.
quired TLS session. This assumption is clearly false for The left cluster shows three configurations of a complete,
large proxy servers that aggregate traffic for many users. functional web server: the Apache HTTP web server
For example, all requests from America Online users ap- (Apache), the Apache TLS web server (Apache+TLS),
6. Label Description of server configuration
Apache Apache server
Apache+TLS Apache server with TLS
Apache+TLS AXL300 Apache server with TLS and AXL300
RSA RSA protected key exchange
PKX plain key exchange
NULL no bulk cipher (plaintext)
RC4 RC4 bulk cipher
noMAC no MAC integrity check
MD5 MD5 MAC integrity check
no cache no session cache
shmcache shared-memory based session cache
perfect cache idealized session cache (always hits)
no randomness no pseudo-random number generation (also: NULL, noMAC)
plain no bulk data marshaling (plaintext written directly to the network)
fast resume simplified TLS session resume (eliminates one round-trip)
Skeleton TLS all messages of correct size, but zero data
1000
885
900 #
824
PIII-500 Mhz
800 PIII-900 Mhz
755
700 610 !
579
600 566
! 544
!
494 509
500
456 464
447
380
387
400
326 ! !
334
317 309
! 295 301 285 301
300 259
178
!
194
! 172 175
200 149
!
95
100 48
0
S
e
00
e
S
he
e
e
e
e
e
e
e
in
e
ch
ch
ch
ch
ch
ch
ch
$
um
TL
ch
TL
m
la
L3
$
c
a
ca
ca
su
ca
ca
ca
ca
ca
p
ca
e+
%
s
n
Ap
AX
e,
re
re
to
m
hm
hm
ch
o
ct
no
hm
ct
ch
$
,n
le
sh
% %
t
fe
S
fe
st
as
a
,s
s
% % %
e
5,
,s
% $
ca
TL
$
Ap
er
AC
fa
er
5,
Sk
%
5,
%
f
AC
D
AC
,p
,p
S,
% %
e+
D
D
ct
,
M
oM
in
%
M
,M
$
fe
TL
M
AC
ss
ch
la
%
oM
%
4,
$ $
er
no
,n
%
4,
p
ne
%
LL
a
C
n
oM
%
,n
,p
% %
C
e,
Ap
LL
to
,R
4,
m
%
U
R
ch
LL
%
ss
le
,n
%
C
o
U
,N
SA
%
X,
e
nd
ca
,R
N
ne
U
LL
% %
Sk
SA
PK
N
R
ra
X,
%
m
SA
U
ct
X,
R
PK
o
N
fe
no
%
nd
R
$
PK
er
X,
%
ra
X,
,p
%
PK
PK
no
ss
%
ne
X,
m
PK
o
nd
ra
no
%
X,
PK
Figure 2. Throughput for CS trace and different server configurations, on 500MHz and 933MHz servers.