SlideShare ist ein Scribd-Unternehmen logo
1 von 10
Downloaden Sie, um offline zu lesen
Network Processing on an SPE Core in Cell Broadband EngineTM
Yuji Kawamura,
Takeshi Yamazaki,
Tatsuya Ishiwata and Kazuyoshi Horie
Microprocessor Development Dept
Sony Computer Entertainment Inc.
Tokyo, Japan
yuji01 kawamura@hq.scei.sony.co.jp,
tyamaki@rd.scei.sony.co.jp,
{tatsuya ishiwata,kazuyoshi horie}@hq.scei.sony.co.jp
Hiroshi Kyusojin
Technology Development Group
Sony Corporation
Tokyo, Japan
kyu@sm.sony.co.jp
Abstract
Cell Broadband EngineTM
is a multi-core system on a
chip and is composed of a general-purpose Power Pro-
cessing Element (PPE) and eight Synergistic Processing
Elements (SPEs). Its high computational performance is
achieved mainly through the SPE’s processing power.
New high-speed NICs such as 10-Gbps Ethernet require
significant amounts of processing power. Even the full pro-
cessing power of PPE is insufficient to attain the maximum
bandwidth on 10-Gbps Ethernet, when running Linux on
Cell Broadband EngineTM
.
In order to avoid the bottlenecks of PPE processing, we
implemented a NIC driver and a protocol stack on an SPE.
We selected a small protocol stack that is designed for em-
bedded systems and made size reductions to put both a pro-
tocol stack and a NIC driver onto a single SPE. Due to the
size limitation of the SPE’s local storage (256-KB).
As a result, the protocol processing on an SPE is almost
at wire speed for UDP and about 8.5 Gbps for TCP with
lightly tuned code, and it requires no assistance from the
PPE while in the data transfer phase.
Our work shows that the use of the SPE instead of the
PPE for network processing can help resolve network per-
formance problems that can arise from handling a high-
speed NIC, including the costs of protocol processing and
memory copies.
The results indicate that our approach can lead to a suf-
ficient level of transfer rate performance.
1. Introduction
The Cell Broadband EngineTM
is a multi-core system on
a chip developed by Sony, IBM and Toshiba [1] [12]. The
Cell Broadband EngineTM
is composed of a Power Process-
ing Element which is a processor core with a PowerPC in-
struction set architecture and eight Synergistic Processing
Elements (SPEs). The SPEs are SIMD instruction set pro-
cessors with 256-KB of local storage (LS). An SPE differs
from conventional processors with cache hierarchies in that
it relies on a DMA engine to move code and data between
the LS and the main memory, which in turn enables explicit
data transfers between memory hierarchies. With a clock
speed of 3.2 GHz, the Cell Broadband EngineTM
has a the-
oretical peak computational performance of 230.4 GFlop/s
(single-precision floating point). The SPEs provide a signif-
icant portion of the computing power in a Cell Broadband
EngineTM
[10]. Therefore, the key to fully utilizing the per-
formance of the Cell Broadband EngineTM
is to fully exploit
the SPEs.
Another important characteristic of the SPE is its pre-
dictability, which stems from its simple pipeline structure
and a flexible DMA engine. The SPE’s predictability en-
ables the development of finely tuned application codes that
precisely and allows control hardware mechanisms, appli-
cation programmers to achieve highly sustainable perfor-
mances with real applications [9] [5]. In addition, the sim-
plified features enable efficient implementations with high
performance-to-area ratios [11] [12].
It has become increasingly difficult to gain improve-
ments in the performance of semiconductor chips through
scaling theory. Therefore, it is increasingly important to ex-
ploit a computation engine with a good performance-to-area
ratio like the SPE.
16th IEEE Symposium on High Performance Interconnects
1550-4794/08 $25.00 © 2008 IEEE
DOI 10.1109/HOTI.2008.16
119
The goal of this work is to use SPEs to support system
infrastructure with the elements essential to computer sys-
tems, such as high-speed networks and storages.
The following characteristics make SPEs potentially su-
perior to existing solutions for high-speed device support.
• The allocation of dedicated resources for individual
devices is effective in achieving short and stable re-
sponse times [7]. Because SPEs occupy a smaller chip
area than general-purpose processor cores of equiva-
lent performance, when one whole processor core is
used as a dedicated processor, the allocation of one
SPE has less impact than that of general-purpose core.
• The use of DMA enables the scheduling of a large
number of memory access requests that are asyn-
chronous to the processor core. SPEs can access the
system memory in a more flexible and tightly sched-
uled way than general cache-based systems during
computation.
• The communication between the SPE and the device is
possible with a low overhead because the devices can
access the high-speed LS directly.
• An SPE differs from a special-purpose device such as
a network processor in that it is a general-purpose re-
source that can be used for a variety of computational
tasks. Thus SPEs can be used for different purposes
depending on the status of system usage.
However, it is necessary to split execution codes and data
for the 256-KB LS, and codes and data must be swapped
as required via explicit DMA requests. This necessitates a
different programming technique from that used in conven-
tional processors, and increases the level of technical diffi-
culty in particular when porting existing large-scale appli-
cations.
This paper provides an overview of the implementation
of network protocol stack processing in an SPE and an eval-
uation of its performance.
The standard protocol stacks incorporated in operating
systems such as Linux have large memory footprints, and
require extensive modification when they are ported to an
SPE. We prioritized design time and selected for a base a
protocol stack with a small footprint designed for an em-
bedded device.
The protocol stack NetX [2], provided by Express Logic,
and the microkernel ThreadX which is required for its oper-
ation were ported to SPE. Porting was enabled by placing a
small number of data structures (connection tables, packet
buffers, etc.) in the main memory and using DMA access.
Generally, protocol stacks use scalar code and it is diffi-
cult to use SIMD instructions to tune performance. In our
experiment, speed was increased through the adjustment of
data alignment and optimization of branching. A transfer
performance of 8.5 Gbps using TCP-IP was achieved when
a single SPE operating at 3.2 GHz was allocated exclusively
to support a 10 Gbps optical Ethernet NIC.
As indicated by Foong,et al. [8], standard TCP-IP net-
work protocol stacks require a processing power of 1
Hz/bps on a Pentium 4. At a time when 1 Gbps is stan-
dard for Ethernet NIC and the use of 10-Gbps Ethernet is
becoming more widespread, network protocol stack pro-
cessing consumes the largest CPU cycle time among the
services that support a single device.
In recent years, HPC cluster systems that use multi-core
processors have become common. On these systems the
load on the CPU assigned to protocol stack processing may
increase, and problems may arise from load unbalancing.
To prevent this, a single core at each node is frequently
assigned to dedicated network processing [6]. In the Cell
Broadband EngineTM
, the use of one of the eight SPEs for
this purpose makes it possible to employ 10-Gbps Ethernet.
This represents a considerable advantage over a system us-
ing conventional high-performance processor cores.
Section 2 below considers related research in order to
clarify the positioning of the project discussed here. Sec-
tion 3 discusses the characteristics of implementation of
protocol stack processing using SPEs and the methods of
optimization, while Section 4 presents the results of perfor-
mance evaluations. Section 5 presents the conclusions of
the project, and Section 6 considers problems and future di-
rections.
2. Related Works
An SPE has an unique instruction set architecture in
which all the computational instructions are SIMD instruc-
tions. This gives the SPEs a high level of floating-point
computation performance per chip area, and they are em-
ployed in various areas of numerical computation applica-
tions [5] [14]. However, system services like the kernel
services of operating systems and IO device processing are
generally scalar tasks with complicated data structures, and
there is no publication handling this theme of poring these
types of processing to an SPE.
We ported a network protocol stack on an SPE. One SPE
was separated from processor scheduling on the Cell Linux
kernel and assigned for processing a network protocol stack.
As a result, this SPE looks like a network processor [7].
As network interfaces began to handle high packet rates
and interrupts reduced system performance. And they
become a problem [13]. To avoid performance reduc-
tion from frequent interrupts, Aron,et al. [4] proposed the
software-timer-based network processing, which can elim-
inate higher frequent interrupts from NIC with timer-based
polling. This approach’s main objectives are to limit net-
120
Table 1. RC-101 10-Gigabit Ethernet Adapter
Features Items
Hardware Vendor Sony Corporation.
Engineering samples available
MAC original
Technical Features IEEE standard IEEE802.3ae 10GBASE-SR
Wiring Multi-mode fiber (300m)
Host bus PCIe (x1, 4, 8)
Transceiver original
Power consumption 7.5W
Software Features Checksum offload IP, UDP, TCP
Jumbo frame 9k
Interrupt moderation original
work processing cycles and to eliminate context switching.
Our approach also eliminates interrupts from NIC, because
all the interactions from NIC are handled by polling on ded-
icated SPE.
Assigning dedicated resources for network processing
has multiple benefits [15] [19]. Wu,et al. [18] shows that
process scheduling by a Linux kernel may largely affect
the performance of a protocol stack. Our approach elim-
inates any interference between protocol stack processing
and other kernel tasks.
There was another challenge of controlling system-wide
performance by restricting CPU cycles for network process-
ing [16]. But our approach can completely offload network
processing from the PPE on which the operating system is
running.
From a security perspective, separating the protocol
stack from a monolithic kernel is preferable [17]. This sug-
gests that our schemes which utilizes separate resources for
protocol stacks can potentially improve system security.
3. Implementing the Network Protocol Stack in
an SPE
This section discusses the system constructed in this
project: hardware and software configurations, the func-
tions of the implemented system. It also addresses the pro-
cesses and performance improvements required to imple-
ment a protocol stack on an SPE.
3.1. System Configuration
Figure 1 shows the configuration of the hardware sys-
tem constructed for this work. We used the Sony RC-101
10-Gigabit Ethernet Adapter. This hardware is not yet com-
mercially available but it is available for evaluation. Table 1
shows the main specifications of the RC-101 10-Gigabit
Ethernet Adapter. The Cell Broadband EngineTM
is con-
nected to RC-101 using a PCIe x8 interface through a pro-
totype high-performance FlexIO-PCIe bridge.
Figure 1. Hardware System
Figure 2. Connection between RC-101 and
SPE
3.2. Basic Strategies for Supporting
TCP/IP on SPE
3.2.1 Protocol stack
The standard protocol stacks incorporated in operating sys-
tems such as Linux have large memory footprints that ex-
ceed the LS size of 256-KB, and extensive modifications
are therefore necessary when they are ported to an SPE. We
prioritized design time, and used a protocol stack for em-
bedded applications, which has a small footprint, as a base.
The protocol stack NetX, manufactured by Express
Logic, and the microkernel ThreadX that is required for its
operation were ported to an SPE.
121
Figure 3. Software System Overview
3.2.2 Pinning
The linux distribution for Cell Broadband EngineTM
treats
the SPEs as virtualized resources, which means that user
applications can generate more SPE threads than available
SPEs, and context switching of the SPE thread may occur
while user applications are running. However, as Figure 2
shows, we stored descriptors in the LS and attempted to op-
erate the NIC directly from the SPE, and it was therefore
necessary to pin an SPE. One SPE was pinned and used ex-
clusively for network processing.
We considered using multiple SPEs to fix the code foot-
print problem, but did not apply this method, in order to free
as many number of SPEs.
3.3. Software Module
Figure 3 shows the configuration of the software system.
The Linux 2.6.20 kernel with a patch for our evaluation
board is used. The kernel is running on the PPE.
One SPE was pinned for the SPE-side components which
are shown on the left. In the SPE, the ThreadX real-time
OS, RC-101 NIC driver, NetX TCP/IP protocol stack and
a service application are running. The service application
communicates with a user space application and executes
requests from the user space application by calling NetX
API. This SPE runs in the Linux kernel space.
The PPE-side component shown in the center (bottom) is
in charge of initializing, managing and terminating the SPE-
side components. Because there is no standard kernel API
call method from SPE to PPE in Linux, the SPE-side com-
ponents cannot directly call kernel API existing in the PPE
core. The PPE-side component is implemented as a ker-
nel module and calls various kernel API as requested by the
SPE side components, including the allocation/deallocation
of memory, obtaining an IO space address and probing the
Table 2. implemented functions
Function Support Enabled
ARP yes yes
RARP yes no
ICMP yes no
IGMP yes no
IPv4 yes yes
IPv4 checksum yes no
IPv6 no
TCP yes yes
TCP checksum yes no
TCP selective ack yes (receive only) yes
TCP slow start algorithm no
TCP fast retransmit no
TCP window scaling no
UDP yes yes
UDP checksum yes no
API NetX specific
socket like API
network adapter. The memory region shown in the center
box in Figure 3 is a kernel space memory region that is al-
located and mapped to the user space by the PPE-side com-
ponent. Communication with the user space is conducted
via ring buffers that are placed in this memory region.
3.4. Functions of Protocol Stack
The NetX is used for supporting TCP and UDP. The TCP
protocol stack implemented in Linux has a variety of ad-
vanced functions. However, NetX is designed for embed-
ded systems where code size is important. Thus, advanced
functions such as window scaling are omitted from NetX.
Table 2 shows the functions of NetX and whether they were
activated in this project. Of the functions supported by
NetX, only the essential functions were selected, in order
to prevent the code size from becoming too large.
3.5. Solving the Code Footprint Problem
We implemented the real-time OS and the protocol stack
onto an SPE, as well as a simple test program in the same
SPE. This exhausted local storage, and no margin remained
for additional functions or packet memory.
Because the registers and operands in an SPE are 128-bit,
more instructions are required to access a structure without
alignment restrictions than when a standard 32-bit RISC
processor is employed. Therefore, code sizes can easily
become large in programs like protocol stacks that access
structures frequently. In order to control the code size, we
added alignment restrictions to the main structures in the
NetX. Because we employed gcc as the compiler, we used
the directive attribute ((aligned(N))), which was devel-
oped for adding alignment restrictions. As a result, while
the data size increased, the code size was reduced. The
size of the .text segment was reduced from 145-KB to 113-
KB after the alignment, representing a reduction of approx-
imately 21%. However, the .bss segment increased approx-
122
Table 3. local store usage
Segment name .text packet pool .bss stacks .rodata .data tcp thread udp .debug ranges other total
Before 144592 49920 39328 17408 4656 1264 480 352 336 128 3680 262144
After 113168 57600 45728 17408 4720 1264 1856 1312 1072 128 17888 262144
Ratio 0.78 1.15 1.16 1.00 1.01 1.00 3.87 3.73 3.19 1.00 4.86 1.00
imately 18% in size following the alignment, from 39-KB
to 46-KB. Table 3 shows a comparison of the segment sizes
in the LS before and after alignment. “packet pool” rep-
resents the packet pool,with the section that maintains the
packet payload at an MTU of 1500 bytes and the structure
that controls the packets. The term, “stacks” refers to the
stack area, which also holds a thread context, while “tcp” is
the TCP control block, and “udp” is the UDP control block.
In the table, “total” represents the size of the LS. Overall,
the addition of alignment restrictions to the structures re-
sulted in a size reduction of approximately 5%.
Additionally, the alignment restrictions resulted in a 27%
improvement of loopback performance in the TCP and 22%
in the UDP.
3.6. Insertion of Branch Prediction to In-
crease Speed
Because SPEs do not have a branch prediction hardware,
static hint branch instructions are used to indicate the fetch
direction. However, even when branch instructions were
counted by the number of “if” statements in the C source
code, there were more than 1200. It was impractical to in-
put the gcc builtin expect code manually. Instead, we used
the gcc profile-guided optimization (PGO) to enable branch
prediction. Communication was via loopbacks from the test
program, and statistics on the direction of branching were
gathered and fed back during compiling. Using PGO, we
were able to improve TCP loopback performance by ap-
proximately 22%.
3.7. Making the System Non-preemptive to
Improve Performance
The protocol stack that we employed in this project,
NetX, is a commercial stack, and is designed for use upon
the embedded real-time OS ThreadX, which is also man-
ufactured by Express Logic. We ran NetX by porting
ThreadX to the SPE. And preventing ThreadX from being
triggered by any type of interrupt processing other than the
timer processing of the thread by SPE decrementer inter-
rupts. Interrupts of the SPE core can potentially be oper-
ated by NICs from the MFC Memory Mapped I/O (MMIO)
register, but the SPE processing speed is fast enough, and
at low load, tasks can be rescheduled in round-robin cycles
of approximately 400 nano seconds. Given this high-speed
response, polling was employed. In addition, the exclusive
use of an SPE made it unnecessary to run in coordination
with unknown non-network processes (unlike CPU cores
running a normal OS) and therefore the maximum delay
could be controlled.
ThreadX supports preemption, but we did not enable pre-
emptive processing. As indicated above, the system is in
control of the entire task processing in the SPE. Therefore
non-preemptive task processing can be conducted without
delays, and exclusion processing such as mutexes can be
eliminated, resulting in higher speeds. Eliminating mutex
processing enabled us to increase the bandwidth of loop-
back processing by approximately 13%.
3.8. Use of the SPE to Drive NIC
Because SPEs have a DMAC in every core, they are
also capable of efficient NIC register accesses. On a con-
ventinal processor core (like the PPE), the pipeline stalls
for an extremely long period in units of instruction cycles
during MMIO register access. In contrast, the DMAC in
an SPE can be explicitly controlled by software in an SPE,
enabling instructions to be issued without waiting for time-
consuming MMIO accesses to complete, and programs can
therefore continue to be executed during MMIO accesses.
For example, when sending data, the NIC is notified of the
update in the transmission descriptor, and this notification is
normally implemented by writing the NIC MMIO register.
With SPEs, a DMA transfer instruction can be issued with-
out waiting for this write to complete, and prevents from
stalling.
RC-101 uses 32-byte descriptors placed in the ring buffer
for sending and receiving. We placed the descriptors in the
LS to control RC-101. When the send descriptor is updated,
the RC-101 register is tapped and the descriptor read from
RC-101 is kicked. Packet payloads are placed either in the
LS or on main memory. When placed in the LS, the data
flow is as shown by the solid line in Figure 2. The data flow
when the packet payloads were placed on main memory is
shown by the dotted line in the same figure. In the latter
case, the payload has to be retrieved from the main memory,
doubling the load on the memory system when compared to
LS. However, because it is difficult to place multiple 9KB
payloads in the LS in order to increase the MTU, it is neces-
sary to place payloads on main memory and reduce the data
resident in the LS for high bandwidth.
123
3.9. Increasing the Number of Supported
Sessions
When all functions were handled in the LS without using
the main memory, the maximum number of connected ses-
sions that could be supported simultaneously was 16. We
increased the number of sessions in order to demonstrate
that the SPE can be used for more realistic applications. It
is difficult to increase the number of sessions because of the
packet pool capacity and the sizes of the TCP control block
(TCB) and the UDP control block (UCB). The TCB and
UCB occupy approximately 1-KB and 512-B of the LS, re-
spectively. Therefore if the TCP supports 128 sessions, for
example, half the LS capacity is used. We were able to sup-
port 128 simultaneous TCP sessions by using the techniques
described below.
• Software cache for TCB and UCB
We placed the TCB and UCB on main memory. The
protocol stack retrieves these into a small area in the
LS when required. If the TCB and UCB were modi-
fied they are restored to the main memory. These op-
erations are handled by a software in the SPE.
The TCB and UCB include the port number field
which are frequently searched. We extracted such data
into the LS and made resident for speed-up.
• Placing queues in the main memory
Because the packet pool capacity is essential to guar-
anteeing the TCP retransmit and keeping the data re-
ceiving. The data placed in the TCP queues was saved
to the main memory and the packet pool in the LS was
used as a cache area.
3.10. API for User Space Application
We implemented the following application interfaces to
enable the use of the protocol stack from Linux user space
application processes.
• Proprietary socket interface API (SOCK API)
The SOCK API provides the NetX API without requir-
ing changes in the user space application programs.
The service programs that provide these API are lo-
cated in the service application program section of the
SPE-side components shown in Figure 3. These ser-
vice programs use resources guaranteed by the PPE
kernel module, and employ a memory area shared be-
tween the user space and the kernel space to communi-
cate with user processes. In addition, by calling NetX
API, the service programs use the NetX functions to
realize TCP or UDP communication.
Figure 4. Test Environment for End to End
Communications
Figure 5. Test Environment for Loopback
Communications
• Remote Direct Memory Access protocol API (RDMA
API)
The RDMA API follows the instructions issued by
user applications, and copies data using TCP to com-
municate between remote nodes. Unique protocols
were used in TCP.
• SDP on RDMA API (SDP)
To enable the use of Infiniband SDP functions [3], we
used the library provided in the SDP through emula-
tion of the HCA driver in kernel space. This is to sup-
port applications using standard sockets.
The SOCK API and the RDMA API cannot coexist at the
same time.
4. Performance Evaluation
This section discusses the measurement environment
used in performance measurements, and the measurement
124
results.
4.1. Measurement Environment
The measurement environment of end-to-end communi-
cations is shown in Figure 4. During measurements, RC-
101 units were connected directly with an optical cable with
no switches in between. During end-to-end communica-
tions, the sender side only transmitted data and the receiver
side received data.
The measurement environment of loopback communica-
tion is shown in Figure 5. In loopback, the packets were
reproduced in the protocol stack and returned by means of
memory copies in NetX. During loopback, send and re-
ceive transmissions were executed on a single SPE using
two threads.
Performance data is measured by running the test pro-
gram shown in Figures 4 and 5 on the SPE where the SPE-
side components are running. The NetX socket API is
called directly from the test program. 10 Gigabits of data
is transferred, and the throughput which includes the IP
header and MAC header length is measured. We measured
UDP throughput by varying UDP payload size. Also, we
measured TCP throughput by varying TCP Max Segment
Size (MSS) and TCP window size.
4.2. Loopback Communications Perfor-
mance
Loopback performance measures only the performance
of the protocol stack. However, because the packets are
copied and returned when NetX is used, results are largely
affected by the additional memory copies.
4.2.1 UDP loopback
Figure 6 shows the results of the UDP loopback test. The
bandwidth reaches 7.5 Gbps when MTU is 1472 bytes.
4.2.2 TCP loopback
Figure 7 shows the results of TCP loopback communica-
tions. The bandwidth reaches 4.6 Gbps when MSS is 1460
and window size is approximately 26 KB (MSS × 18).
4.3. End-to-End Communications Perfor-
mance
Two types of data structures are used for communica-
tion between the device driver and the protocol stack. The
packet descriptor carries control information and is placed
on LS to enable low overhead polling by SPE. The other
data structure is a packet buffer. The large data structures
0
1000
2000
3000
4000
5000
6000
7000
8000
0 200 400 600 800 1000 1200 1400 1600
Throughput[Mbps]
Packet Size [bytes]
Figure 6. UDP Loopback Performance
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 5000 10000 15000 20000 25000 30000
Throughput[Mbps]
TCP Receive Window Size [bytes]
MSS 128
MSS 512
MSS1460
Figure 7. TCP Loopback Performance
0
1000
2000
3000
4000
5000
6000
0 200 400 600 800 1000 1200 1400 1600
Throughput[Mbps]
Packet Size [bytes]
Figure 8. UDP End-to-End Performance (Us-
ing LS Only)
125
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 1000 2000 3000 4000 5000 6000 7000 8000
Throughput[Mbps]
Packet Size [bytes]
Figure 9. UDP End-to-End Performance (Us-
ing the Software Cache)
0
1000
2000
3000
4000
5000
6000
0 5000 10000 15000 20000 25000 30000 35000 40000
Throughput[Mbps]
TCP Receive Window Size [bytes]
MSS 128
MSS 512
MSS 1460
Figure 10. TCP End-To-End Performance (Us-
ing LS Only)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 10000 20000 30000 40000 50000 60000
Throughput[Mbps]
TCP Receive Window Size [bytes]
MSS 128
MSS 512
MSS 1460
MSS 3000
MSS 5000
MSS 7000
MSS 8960
Figure 11. TCP End-To-End Performance (Us-
ing The Software Cache)
that are accessed by the protocol stack are UCB and TCB.
We compared two ways of arranging these data structures.
• LS configuration
This configuration puts the entire packet buffer, the
UCB and the TCB on LS. This is good for data ac-
cess latency and main memory loads. And thus attains
the better packet handling rate. However, because of
the LS size limitation, we limited the MTU size to
1500 bytes. With this configuration, eight descriptors
are assigned for sending and four descriptors are as-
signed for receiving.
• Main memory configuration
This configuration uses LS as a cache memory. The
data structures that can be cached is UCB, TCB, and
packet queues. The total size of these data structures
can be larger than the LS size. Thus the MTU size and
packet pool size can be configured to be large enough.
64 descriptors for RC-101 NIC are assigned for send-
ing and 32 descriptors are for receiving.
4.3.1 UDP
This section discusses the results for UDP end-to-end com-
munications. Figures 8 and 9 show the results for UDP end-
to-end communications. The figures represent the LS con-
figuration and the main memory configuration, respectively.
As shown in Figure 9, wire speeds were almost attained for
UDP. The result of Figure 8 is not up to wire speed, because
of the limitation on MTU size and the number of descrip-
tors. In the main memory configuration, there is only one
session, so there are no cache misses for UCB. But incom-
ing packets with higher rates than the processing power are
put into the UDP receive queue. This operation causes a
main memory access and lowers the throughput.
4.3.2 TCP
This section discusses the results for TCP end-to-end com-
munications. Figures 10 and 11 show the results for TCP
end-to-end communications. The figures represents the LS
configuration and the main memory configuration, respec-
tively. The throughput of Figure 10 shows much smoother
results than Figure 11, because Figure 10 involves no main
memory accesses and has static packet processing times. In
Figure 10, the maximum throughput of 5.9 Gbps is reached
when is MSS 1460 bytes and the TCP window size is 23-
KB (MSS × 16).
In Figure 11, a higher maximum throughput of 8.5 Gbps
is reached at a 5000 bytes MSS and a 60-KB TCP win-
dow size (MSS × 12). This is because the main memory
126
configuration can utilize a larger MTU size and larger de-
scriptors than the LS configuration. But in the main mem-
ory configuration, both the sender and receiver must access
the main memory at each packet processing to maintain
transmit queue and receive queue. The overhead of these
main memory accesses resulted in limiting throughput of
the main memory configuration.
4.3.3 Maximum Packet Rate
During transfers from LS, packet processing performance
is approximately 870K [packet/sec]. During transfers from
the main memory, packet processing performance is ap-
proximately 530K [packet/sec]. These are measured in
UDP end-to-end communications.
5. Conclusion
The Cell Broadband EngineTM
features eight processors
with a unique design, called SPEs, on a single chip. We im-
plemented a network protocol stack on one of these SPEs.
On this single SPE, we operated the protocol stack NetX,
developed for embedded applications, and the microkernel
ThreadX required for its operation. Both were manufac-
tured by Express Logic. In addition there are communica-
tion interfaces enabling the protocol-stack-driven SPE to be
accessed from device drivers and other processors.
The most significant characteristic of SPE architecture is
its omission of a cache hierarchy; the SPE has an LS that
is controlled via DMA. This necessitates a different form of
programming than that used for a conventional processor.
In implementing the protocol stacks, we employed software
caches for processing a number of data structures.
We optimized the implementation of the NIC device
drivers and the protocol stack in the fallowing various ways.
• Communication between the network hardware and
the processor was realized with a low overhead by
means of direct access to the LS from the NIC.
• Data alignment was adjusted to prevent performance
from declining and the code size from increasing. (The
SPE uses 128-bit load/store instructions, rather than
scalar data load/store instructions.)
• Profile-based static branching prediction was used to
support code scheduling and branch hint instruction
scheduling by the compiler.
As a result, we achieved a TCP performance of 8.5 Gbps
for 3-KB packet sizes using a single SPE operating at
3.2 GHz. This result indicates a competitive network proto-
col processing performance, considering that we employed
a processor designed for a variety of computational applica-
tions rather than a dedicated network processor, and demon-
strates the potential of the application of SPEs in this field.
6. Future Directions
With the results achieved in this work as a starting point,
the following directions can be indicated for the future de-
velopment of network processing using SPEs.
• Further optimization for SPE use
We used 32-bit scalar code, and therefore did not make
effective use of the SPE’s 128-bit SIMD data path. We
believe that there is potential for optimization in nu-
merous areas through the significant modification of
code structures. For example conditional branches can
be replaced by select bit instructions or shuffle byte
operations can be used for packet header arrangement.
Further optimization can be expected to result in im-
proved performance.
• Seamless integration in Linux kernels
Because NetX is designed for embedded applications,
it is not compatible with standard Linux protocol
stacks, and therefore cannot replace Linux protocol
stacks. Adding the required functions and integrat-
ing the protocol stack in a Linux kernel would make
it accessible from a Linux application in the same way
as a standard protocol stack. We believe that a Linux
kernel patch shared with a full TCP-IP offload using
a network processor [7] could be used to enable this
integration.
• Implementation of various types of network packet
processing
In this work, we focused on basic protocol stack pro-
cessing, but further network packet processing can be
considered for the future. Encryption-related process-
ing such as IP-SEC and DTCP over IP, packet filter-
ing, and other applications can be named as the next
targets. SPEs have shown excellent performance in
encryption and decryption processing, as typified by
AES [5]. SIMD processing can also be expected to in-
crease the speed of pattern matching processing, and
may be applicable for accelerating packet filtering.
As the next step in our work, we are developing a test
application that embeds the protocol stack discussed here.
To avoid the time required for rewriting a large-scale soft-
ware for the SPE, and to generate usable results in a short
period, we are attempting to apply the concept of web ser-
vices that employ combinations of highly independent ser-
vices. Specifically, this combines a group of major element
127
technologies and the Cell Broadband EngineTM
to realize
an independent service package that can be accessed via a
network.
In addition to the network interface driven by the pro-
tocol stack discussed here, we intend to port and install a
storage device, a file system and a database engine onto the
SPE. By combining a single Cell Broadband EngineTM
with
commodity parts, we will implement a media data service
package that includes media streaming on a bandwidth of
several hundred MB/s, and meta-data processing.
7. Acknowledgements
The authors would like to thank Bill Lamie and David
Lamie of Express Logic for their cooperation and for dis-
closing the NetX and ThreadX source code.
The authors also thank to Hisashi Tomita, Youichi Mizu-
tani, Masahiro Kajimoto, Yuichi Machida, Kazuyoshi Ya-
mada, Tsuyoshi Kano, Mitsuki Hinosugi of Sony Corpora-
tion for giving us the resources for evaluation with the RC-
101 10-Gigabit Ethernet Adapter and for helping in perfor-
mance tuning.
And the authors thank to Mariko Ozawa, Ken Kurihara,
Takuhito Tsujimoto and Daniel Toshio Harrel for reviewing
this paper.
Finally, the author would like to thank children of Awa-
fune Football Club for their exuberant energy.
References
[1] Cell Broadband EngineTM
Home Page at Sony Computer
Entertainment Inc. http://cell.scei.co.jp/.
[2] Express Logic Home Page . http://www.rtos.com/.
[3] Openfabrics alliance home page. http://www.
openfabrics.org/.
[4] M. Aron and P. Druschel. Soft timers: efficient microsec-
ond software timer support for network processing. ACM
Transactions on Computer Systems (TOCS), 18(3):197–228,
2000.
[5] T. Chen, R. Raghavan, J. Dale, and E. Iwata. Cell Broadband
Engine Architecture and its first implementation- A perfor-
mance view. IBM Journal of Research and Development,
51(5):559–572, 2007.
[6] P. Crowley, M. Fluczynski, J. Baer, and B. Bershad. Charac-
terizing processor architectures for programmable network
interfaces. Proceedings of the 14th international conference
on Supercomputing, pages 54–65, 2000.
[7] W. Feng, P. Balaji, C. Baron, L. Bhuyan, and D. Panda. Per-
formance Characterization of a 10-Gigabit Ethernet TOE.
Proceedings of the IEEE International Symposium on High-
Performance Interconnects (HotI), 2005.
[8] A. Foong, T. Huff, H. Hum, J. Patwardhan, and G. Regnier.
TCP performance re-visited. Performance Analysis of Sys-
tems and Software, 2003. ISPASS. 2003 IEEE International
Symposium on, pages 70–79, 2003.
[9] M. Gschwind. Chip multiprocessing and the cell broadband
engine. Proceedings of the 3rd conference on Computing
frontiers, pages 1–8, 2006.
[10] M. Gschwind, P. Hofstee, B. Flachs, M. Hopkins, Y. Watan-
abe, and T. Yamazaki. A novel simd architecture for the cell
heterogeneous chip-multiprocessor. Hot Chips 17, August
2005.
[11] H. Hofstee. Power efficient processor architecture and the
cell processor. High-Performance Computer Architecture,
2005. HPCA-11. 11th International Symposium on, pages
258–262, 2005.
[12] J. Kahle et al. Introduction to the Cell multiprocessor.
IBM Journal of Research and Development, 49(4):589–604,
2005.
[13] I. Kim, J. Moon, and H. Yeom. Timer-Based Interrupt Mit-
igation for High Performance Packet Processing. Proceed-
ings of 5th International Conference on High-Performance
Computing in the Asia-Pacific Region, 2001.
[14] J. Kurzak and J. Dongarra. Implementation of the Mixed-
Precision High Performance LINPACK Benchmark on the
CELL Processor Technical Report UT-CS-06-580.
[15] G. Regnier, D. Minturn, G. McAlpine, V. Saletore, and
A. Foong. ETA: Experience with an Intel Xeon Processor
as a Packet Processing Engine. 2004.
[16] K. Ryu, J. Hollingsworth, and P. Keleher. Efficient network
and I/O throttling for fine-grain cycle stealing. Proceedings
of Supercomputing01, 2001.
[17] A. Sinha, S. Sarat, and J. Shapiro. Network subsystems
reloaded: a high-performance, defensible network subsys-
tem. Proceedings of the USENIX Annual Technical Confer-
ence 2004 on USENIX Annual Technical Conference table
of contents, pages 19–19, 2004.
[18] W. Wu and M. Crawford. Potential performance bottleneck
in linux tcp. Int. J. Commun. Syst., 20(11):1263–1283, 2007.
[19] B. Wun and P. Crowley. Network I/O Acceleration in
Heterogeneous Multicore Processors. Proceedings of the
14th IEEE Symposium on High-Performance Interconnects,
pages 9–14, 2006.
128

Weitere ähnliche Inhalte

Was ist angesagt?

Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...Léia de Sousa
 
Dynamic classification in silicon-based forwarding engine environments
Dynamic classification in silicon-based forwarding engine environmentsDynamic classification in silicon-based forwarding engine environments
Dynamic classification in silicon-based forwarding engine environmentsTal Lavian Ph.D.
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
TRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
TRACK B: Multicores & Network On Chip Architectures/ Oren HollanderTRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
TRACK B: Multicores & Network On Chip Architectures/ Oren Hollanderchiportal
 
Intelligent space buffers power efficient solution for network on chip
Intelligent space buffers   power efficient solution for network on chipIntelligent space buffers   power efficient solution for network on chip
Intelligent space buffers power efficient solution for network on chipDhiraj Chaudhary
 
D17316 gc20 l03_broker_em
D17316 gc20 l03_broker_emD17316 gc20 l03_broker_em
D17316 gc20 l03_broker_emMoeen_uddin
 
NoC simulators presentation
NoC simulators presentationNoC simulators presentation
NoC simulators presentationHossam Hassan
 
Less14 br concepts
Less14 br conceptsLess14 br concepts
Less14 br conceptsAmit Bhalla
 
Less01 architecture
Less01 architectureLess01 architecture
Less01 architectureAmit Bhalla
 
Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...balmanme
 
Less17 moving data
Less17 moving dataLess17 moving data
Less17 moving dataAmit Bhalla
 
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...IOSRJVSP
 
Less13 performance
Less13 performanceLess13 performance
Less13 performanceAmit Bhalla
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You? EMC
 
Intel Optane Data Center Persistent Memory
Intel Optane Data Center Persistent MemoryIntel Optane Data Center Persistent Memory
Intel Optane Data Center Persistent Memoryinside-BigData.com
 
Less02 installation
Less02 installationLess02 installation
Less02 installationAmit Bhalla
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) ijceronline
 
02eb68ab60b13bbfca46c09bb73eb608d353
02eb68ab60b13bbfca46c09bb73eb608d35302eb68ab60b13bbfca46c09bb73eb608d353
02eb68ab60b13bbfca46c09bb73eb608d353Shanti Tripathi
 

Was ist angesagt? (19)

Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
 
Dynamic classification in silicon-based forwarding engine environments
Dynamic classification in silicon-based forwarding engine environmentsDynamic classification in silicon-based forwarding engine environments
Dynamic classification in silicon-based forwarding engine environments
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
TRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
TRACK B: Multicores & Network On Chip Architectures/ Oren HollanderTRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
TRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
 
Intelligent space buffers power efficient solution for network on chip
Intelligent space buffers   power efficient solution for network on chipIntelligent space buffers   power efficient solution for network on chip
Intelligent space buffers power efficient solution for network on chip
 
D17316 gc20 l03_broker_em
D17316 gc20 l03_broker_emD17316 gc20 l03_broker_em
D17316 gc20 l03_broker_em
 
Ixp switches 1000322 en
Ixp switches 1000322 enIxp switches 1000322 en
Ixp switches 1000322 en
 
NoC simulators presentation
NoC simulators presentationNoC simulators presentation
NoC simulators presentation
 
Less14 br concepts
Less14 br conceptsLess14 br concepts
Less14 br concepts
 
Less01 architecture
Less01 architectureLess01 architecture
Less01 architecture
 
Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...
 
Less17 moving data
Less17 moving dataLess17 moving data
Less17 moving data
 
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
 
Less13 performance
Less13 performanceLess13 performance
Less13 performance
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
 
Intel Optane Data Center Persistent Memory
Intel Optane Data Center Persistent MemoryIntel Optane Data Center Persistent Memory
Intel Optane Data Center Persistent Memory
 
Less02 installation
Less02 installationLess02 installation
Less02 installation
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
02eb68ab60b13bbfca46c09bb73eb608d353
02eb68ab60b13bbfca46c09bb73eb608d35302eb68ab60b13bbfca46c09bb73eb608d353
02eb68ab60b13bbfca46c09bb73eb608d353
 

Ähnlich wie Network Processing on an SPE Core in Cell Broadband EngineTM

Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureMichael Gschwind
 
ETHERNET PACKET PROCESSOR FOR SOC APPLICATION
ETHERNET PACKET PROCESSOR FOR SOC APPLICATIONETHERNET PACKET PROCESSOR FOR SOC APPLICATION
ETHERNET PACKET PROCESSOR FOR SOC APPLICATIONcscpconf
 
1.multicore processors
1.multicore processors1.multicore processors
1.multicore processorsHebeon1
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
Intel new processors
Intel new processorsIntel new processors
Intel new processorszaid_b
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersRyousei Takano
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009Léia de Sousa
 
Juniper Networks Router Architecture
Juniper Networks Router ArchitectureJuniper Networks Router Architecture
Juniper Networks Router Architecturelawuah
 
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkThe Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkLenovo Data Center
 
Survey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSurvey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSahil Kaw
 
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...CSCJournals
 
Implementation of RISC-Based Architecture for Low power applications
Implementation of RISC-Based Architecture for Low power applicationsImplementation of RISC-Based Architecture for Low power applications
Implementation of RISC-Based Architecture for Low power applicationsIOSR Journals
 
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Michelle Holley
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfanil0878
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingeSAT Journals
 
Instruction Set Extension of a Low-End Reconfigurable Microcontroller in Bit-...
Instruction Set Extension of a Low-End Reconfigurable Microcontroller in Bit-...Instruction Set Extension of a Low-End Reconfigurable Microcontroller in Bit-...
Instruction Set Extension of a Low-End Reconfigurable Microcontroller in Bit-...IJECEIAES
 
Hyper threading technology
Hyper threading technologyHyper threading technology
Hyper threading technologyNikhil Venugopal
 

Ähnlich wie Network Processing on an SPE Core in Cell Broadband EngineTM (20)

Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architecture
 
ETHERNET PACKET PROCESSOR FOR SOC APPLICATION
ETHERNET PACKET PROCESSOR FOR SOC APPLICATIONETHERNET PACKET PROCESSOR FOR SOC APPLICATION
ETHERNET PACKET PROCESSOR FOR SOC APPLICATION
 
1.multicore processors
1.multicore processors1.multicore processors
1.multicore processors
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009
 
IBM Netezza
IBM NetezzaIBM Netezza
IBM Netezza
 
Juniper Networks Router Architecture
Juniper Networks Router ArchitectureJuniper Networks Router Architecture
Juniper Networks Router Architecture
 
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkThe Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
 
Survey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSurvey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning Algorithm
 
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
 
Implementation of RISC-Based Architecture for Low power applications
Implementation of RISC-Based Architecture for Low power applicationsImplementation of RISC-Based Architecture for Low power applications
Implementation of RISC-Based Architecture for Low power applications
 
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
 
Multi-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IKMulti-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IK
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passing
 
Instruction Set Extension of a Low-End Reconfigurable Microcontroller in Bit-...
Instruction Set Extension of a Low-End Reconfigurable Microcontroller in Bit-...Instruction Set Extension of a Low-End Reconfigurable Microcontroller in Bit-...
Instruction Set Extension of a Low-End Reconfigurable Microcontroller in Bit-...
 
IEEExeonmem
IEEExeonmemIEEExeonmem
IEEExeonmem
 
Hyper threading technology
Hyper threading technologyHyper threading technology
Hyper threading technology
 

Mehr von Slide_N

SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...Slide_N
 
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdfParallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdfSlide_N
 
Experiences with PlayStation VR - Sony Interactive Entertainment
Experiences with PlayStation VR  - Sony Interactive EntertainmentExperiences with PlayStation VR  - Sony Interactive Entertainment
Experiences with PlayStation VR - Sony Interactive EntertainmentSlide_N
 
SPU-based Deferred Shading for Battlefield 3 on Playstation 3
SPU-based Deferred Shading for Battlefield 3 on Playstation 3SPU-based Deferred Shading for Battlefield 3 on Playstation 3
SPU-based Deferred Shading for Battlefield 3 on Playstation 3Slide_N
 
Filtering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-AliasingFiltering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-AliasingSlide_N
 
Chip Multiprocessing and the Cell Broadband Engine.pdf
Chip Multiprocessing and the Cell Broadband Engine.pdfChip Multiprocessing and the Cell Broadband Engine.pdf
Chip Multiprocessing and the Cell Broadband Engine.pdfSlide_N
 
Cell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology GroupCell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology GroupSlide_N
 
New Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiNew Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiSlide_N
 
Sony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSlide_N
 
Sony Transformation 60
Sony Transformation 60 Sony Transformation 60
Sony Transformation 60 Slide_N
 
Moving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomMoving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomSlide_N
 
The Technology behind PlayStation 2
The Technology behind PlayStation 2The Technology behind PlayStation 2
The Technology behind PlayStation 2Slide_N
 
Cell Technology for Graphics and Visualization
Cell Technology for Graphics and VisualizationCell Technology for Graphics and Visualization
Cell Technology for Graphics and VisualizationSlide_N
 
Industry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignIndustry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignSlide_N
 
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotTranslating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotSlide_N
 
Cellular Neural Networks: Theory
Cellular Neural Networks: TheoryCellular Neural Networks: Theory
Cellular Neural Networks: TheorySlide_N
 
Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Slide_N
 
Developing Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionDeveloping Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionSlide_N
 
NVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerNVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerSlide_N
 
The Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesThe Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesSlide_N
 

Mehr von Slide_N (20)

SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
 
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdfParallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
 
Experiences with PlayStation VR - Sony Interactive Entertainment
Experiences with PlayStation VR  - Sony Interactive EntertainmentExperiences with PlayStation VR  - Sony Interactive Entertainment
Experiences with PlayStation VR - Sony Interactive Entertainment
 
SPU-based Deferred Shading for Battlefield 3 on Playstation 3
SPU-based Deferred Shading for Battlefield 3 on Playstation 3SPU-based Deferred Shading for Battlefield 3 on Playstation 3
SPU-based Deferred Shading for Battlefield 3 on Playstation 3
 
Filtering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-AliasingFiltering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-Aliasing
 
Chip Multiprocessing and the Cell Broadband Engine.pdf
Chip Multiprocessing and the Cell Broadband Engine.pdfChip Multiprocessing and the Cell Broadband Engine.pdf
Chip Multiprocessing and the Cell Broadband Engine.pdf
 
Cell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology GroupCell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology Group
 
New Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiNew Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - Kutaragi
 
Sony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSony Transformation 60 - Kutaragi
Sony Transformation 60 - Kutaragi
 
Sony Transformation 60
Sony Transformation 60 Sony Transformation 60
Sony Transformation 60
 
Moving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomMoving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living Room
 
The Technology behind PlayStation 2
The Technology behind PlayStation 2The Technology behind PlayStation 2
The Technology behind PlayStation 2
 
Cell Technology for Graphics and Visualization
Cell Technology for Graphics and VisualizationCell Technology for Graphics and Visualization
Cell Technology for Graphics and Visualization
 
Industry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignIndustry Trends in Microprocessor Design
Industry Trends in Microprocessor Design
 
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotTranslating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
 
Cellular Neural Networks: Theory
Cellular Neural Networks: TheoryCellular Neural Networks: Theory
Cellular Neural Networks: Theory
 
Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3
 
Developing Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionDeveloping Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of Destruction
 
NVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerNVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM Power
 
The Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesThe Visual Computing Revolution Continues
The Visual Computing Revolution Continues
 

Kürzlich hochgeladen

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 

Kürzlich hochgeladen (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Network Processing on an SPE Core in Cell Broadband EngineTM

  • 1. Network Processing on an SPE Core in Cell Broadband EngineTM Yuji Kawamura, Takeshi Yamazaki, Tatsuya Ishiwata and Kazuyoshi Horie Microprocessor Development Dept Sony Computer Entertainment Inc. Tokyo, Japan yuji01 kawamura@hq.scei.sony.co.jp, tyamaki@rd.scei.sony.co.jp, {tatsuya ishiwata,kazuyoshi horie}@hq.scei.sony.co.jp Hiroshi Kyusojin Technology Development Group Sony Corporation Tokyo, Japan kyu@sm.sony.co.jp Abstract Cell Broadband EngineTM is a multi-core system on a chip and is composed of a general-purpose Power Pro- cessing Element (PPE) and eight Synergistic Processing Elements (SPEs). Its high computational performance is achieved mainly through the SPE’s processing power. New high-speed NICs such as 10-Gbps Ethernet require significant amounts of processing power. Even the full pro- cessing power of PPE is insufficient to attain the maximum bandwidth on 10-Gbps Ethernet, when running Linux on Cell Broadband EngineTM . In order to avoid the bottlenecks of PPE processing, we implemented a NIC driver and a protocol stack on an SPE. We selected a small protocol stack that is designed for em- bedded systems and made size reductions to put both a pro- tocol stack and a NIC driver onto a single SPE. Due to the size limitation of the SPE’s local storage (256-KB). As a result, the protocol processing on an SPE is almost at wire speed for UDP and about 8.5 Gbps for TCP with lightly tuned code, and it requires no assistance from the PPE while in the data transfer phase. Our work shows that the use of the SPE instead of the PPE for network processing can help resolve network per- formance problems that can arise from handling a high- speed NIC, including the costs of protocol processing and memory copies. The results indicate that our approach can lead to a suf- ficient level of transfer rate performance. 1. Introduction The Cell Broadband EngineTM is a multi-core system on a chip developed by Sony, IBM and Toshiba [1] [12]. The Cell Broadband EngineTM is composed of a Power Process- ing Element which is a processor core with a PowerPC in- struction set architecture and eight Synergistic Processing Elements (SPEs). The SPEs are SIMD instruction set pro- cessors with 256-KB of local storage (LS). An SPE differs from conventional processors with cache hierarchies in that it relies on a DMA engine to move code and data between the LS and the main memory, which in turn enables explicit data transfers between memory hierarchies. With a clock speed of 3.2 GHz, the Cell Broadband EngineTM has a the- oretical peak computational performance of 230.4 GFlop/s (single-precision floating point). The SPEs provide a signif- icant portion of the computing power in a Cell Broadband EngineTM [10]. Therefore, the key to fully utilizing the per- formance of the Cell Broadband EngineTM is to fully exploit the SPEs. Another important characteristic of the SPE is its pre- dictability, which stems from its simple pipeline structure and a flexible DMA engine. The SPE’s predictability en- ables the development of finely tuned application codes that precisely and allows control hardware mechanisms, appli- cation programmers to achieve highly sustainable perfor- mances with real applications [9] [5]. In addition, the sim- plified features enable efficient implementations with high performance-to-area ratios [11] [12]. It has become increasingly difficult to gain improve- ments in the performance of semiconductor chips through scaling theory. Therefore, it is increasingly important to ex- ploit a computation engine with a good performance-to-area ratio like the SPE. 16th IEEE Symposium on High Performance Interconnects 1550-4794/08 $25.00 © 2008 IEEE DOI 10.1109/HOTI.2008.16 119
  • 2. The goal of this work is to use SPEs to support system infrastructure with the elements essential to computer sys- tems, such as high-speed networks and storages. The following characteristics make SPEs potentially su- perior to existing solutions for high-speed device support. • The allocation of dedicated resources for individual devices is effective in achieving short and stable re- sponse times [7]. Because SPEs occupy a smaller chip area than general-purpose processor cores of equiva- lent performance, when one whole processor core is used as a dedicated processor, the allocation of one SPE has less impact than that of general-purpose core. • The use of DMA enables the scheduling of a large number of memory access requests that are asyn- chronous to the processor core. SPEs can access the system memory in a more flexible and tightly sched- uled way than general cache-based systems during computation. • The communication between the SPE and the device is possible with a low overhead because the devices can access the high-speed LS directly. • An SPE differs from a special-purpose device such as a network processor in that it is a general-purpose re- source that can be used for a variety of computational tasks. Thus SPEs can be used for different purposes depending on the status of system usage. However, it is necessary to split execution codes and data for the 256-KB LS, and codes and data must be swapped as required via explicit DMA requests. This necessitates a different programming technique from that used in conven- tional processors, and increases the level of technical diffi- culty in particular when porting existing large-scale appli- cations. This paper provides an overview of the implementation of network protocol stack processing in an SPE and an eval- uation of its performance. The standard protocol stacks incorporated in operating systems such as Linux have large memory footprints, and require extensive modification when they are ported to an SPE. We prioritized design time and selected for a base a protocol stack with a small footprint designed for an em- bedded device. The protocol stack NetX [2], provided by Express Logic, and the microkernel ThreadX which is required for its oper- ation were ported to SPE. Porting was enabled by placing a small number of data structures (connection tables, packet buffers, etc.) in the main memory and using DMA access. Generally, protocol stacks use scalar code and it is diffi- cult to use SIMD instructions to tune performance. In our experiment, speed was increased through the adjustment of data alignment and optimization of branching. A transfer performance of 8.5 Gbps using TCP-IP was achieved when a single SPE operating at 3.2 GHz was allocated exclusively to support a 10 Gbps optical Ethernet NIC. As indicated by Foong,et al. [8], standard TCP-IP net- work protocol stacks require a processing power of 1 Hz/bps on a Pentium 4. At a time when 1 Gbps is stan- dard for Ethernet NIC and the use of 10-Gbps Ethernet is becoming more widespread, network protocol stack pro- cessing consumes the largest CPU cycle time among the services that support a single device. In recent years, HPC cluster systems that use multi-core processors have become common. On these systems the load on the CPU assigned to protocol stack processing may increase, and problems may arise from load unbalancing. To prevent this, a single core at each node is frequently assigned to dedicated network processing [6]. In the Cell Broadband EngineTM , the use of one of the eight SPEs for this purpose makes it possible to employ 10-Gbps Ethernet. This represents a considerable advantage over a system us- ing conventional high-performance processor cores. Section 2 below considers related research in order to clarify the positioning of the project discussed here. Sec- tion 3 discusses the characteristics of implementation of protocol stack processing using SPEs and the methods of optimization, while Section 4 presents the results of perfor- mance evaluations. Section 5 presents the conclusions of the project, and Section 6 considers problems and future di- rections. 2. Related Works An SPE has an unique instruction set architecture in which all the computational instructions are SIMD instruc- tions. This gives the SPEs a high level of floating-point computation performance per chip area, and they are em- ployed in various areas of numerical computation applica- tions [5] [14]. However, system services like the kernel services of operating systems and IO device processing are generally scalar tasks with complicated data structures, and there is no publication handling this theme of poring these types of processing to an SPE. We ported a network protocol stack on an SPE. One SPE was separated from processor scheduling on the Cell Linux kernel and assigned for processing a network protocol stack. As a result, this SPE looks like a network processor [7]. As network interfaces began to handle high packet rates and interrupts reduced system performance. And they become a problem [13]. To avoid performance reduc- tion from frequent interrupts, Aron,et al. [4] proposed the software-timer-based network processing, which can elim- inate higher frequent interrupts from NIC with timer-based polling. This approach’s main objectives are to limit net- 120
  • 3. Table 1. RC-101 10-Gigabit Ethernet Adapter Features Items Hardware Vendor Sony Corporation. Engineering samples available MAC original Technical Features IEEE standard IEEE802.3ae 10GBASE-SR Wiring Multi-mode fiber (300m) Host bus PCIe (x1, 4, 8) Transceiver original Power consumption 7.5W Software Features Checksum offload IP, UDP, TCP Jumbo frame 9k Interrupt moderation original work processing cycles and to eliminate context switching. Our approach also eliminates interrupts from NIC, because all the interactions from NIC are handled by polling on ded- icated SPE. Assigning dedicated resources for network processing has multiple benefits [15] [19]. Wu,et al. [18] shows that process scheduling by a Linux kernel may largely affect the performance of a protocol stack. Our approach elim- inates any interference between protocol stack processing and other kernel tasks. There was another challenge of controlling system-wide performance by restricting CPU cycles for network process- ing [16]. But our approach can completely offload network processing from the PPE on which the operating system is running. From a security perspective, separating the protocol stack from a monolithic kernel is preferable [17]. This sug- gests that our schemes which utilizes separate resources for protocol stacks can potentially improve system security. 3. Implementing the Network Protocol Stack in an SPE This section discusses the system constructed in this project: hardware and software configurations, the func- tions of the implemented system. It also addresses the pro- cesses and performance improvements required to imple- ment a protocol stack on an SPE. 3.1. System Configuration Figure 1 shows the configuration of the hardware sys- tem constructed for this work. We used the Sony RC-101 10-Gigabit Ethernet Adapter. This hardware is not yet com- mercially available but it is available for evaluation. Table 1 shows the main specifications of the RC-101 10-Gigabit Ethernet Adapter. The Cell Broadband EngineTM is con- nected to RC-101 using a PCIe x8 interface through a pro- totype high-performance FlexIO-PCIe bridge. Figure 1. Hardware System Figure 2. Connection between RC-101 and SPE 3.2. Basic Strategies for Supporting TCP/IP on SPE 3.2.1 Protocol stack The standard protocol stacks incorporated in operating sys- tems such as Linux have large memory footprints that ex- ceed the LS size of 256-KB, and extensive modifications are therefore necessary when they are ported to an SPE. We prioritized design time, and used a protocol stack for em- bedded applications, which has a small footprint, as a base. The protocol stack NetX, manufactured by Express Logic, and the microkernel ThreadX that is required for its operation were ported to an SPE. 121
  • 4. Figure 3. Software System Overview 3.2.2 Pinning The linux distribution for Cell Broadband EngineTM treats the SPEs as virtualized resources, which means that user applications can generate more SPE threads than available SPEs, and context switching of the SPE thread may occur while user applications are running. However, as Figure 2 shows, we stored descriptors in the LS and attempted to op- erate the NIC directly from the SPE, and it was therefore necessary to pin an SPE. One SPE was pinned and used ex- clusively for network processing. We considered using multiple SPEs to fix the code foot- print problem, but did not apply this method, in order to free as many number of SPEs. 3.3. Software Module Figure 3 shows the configuration of the software system. The Linux 2.6.20 kernel with a patch for our evaluation board is used. The kernel is running on the PPE. One SPE was pinned for the SPE-side components which are shown on the left. In the SPE, the ThreadX real-time OS, RC-101 NIC driver, NetX TCP/IP protocol stack and a service application are running. The service application communicates with a user space application and executes requests from the user space application by calling NetX API. This SPE runs in the Linux kernel space. The PPE-side component shown in the center (bottom) is in charge of initializing, managing and terminating the SPE- side components. Because there is no standard kernel API call method from SPE to PPE in Linux, the SPE-side com- ponents cannot directly call kernel API existing in the PPE core. The PPE-side component is implemented as a ker- nel module and calls various kernel API as requested by the SPE side components, including the allocation/deallocation of memory, obtaining an IO space address and probing the Table 2. implemented functions Function Support Enabled ARP yes yes RARP yes no ICMP yes no IGMP yes no IPv4 yes yes IPv4 checksum yes no IPv6 no TCP yes yes TCP checksum yes no TCP selective ack yes (receive only) yes TCP slow start algorithm no TCP fast retransmit no TCP window scaling no UDP yes yes UDP checksum yes no API NetX specific socket like API network adapter. The memory region shown in the center box in Figure 3 is a kernel space memory region that is al- located and mapped to the user space by the PPE-side com- ponent. Communication with the user space is conducted via ring buffers that are placed in this memory region. 3.4. Functions of Protocol Stack The NetX is used for supporting TCP and UDP. The TCP protocol stack implemented in Linux has a variety of ad- vanced functions. However, NetX is designed for embed- ded systems where code size is important. Thus, advanced functions such as window scaling are omitted from NetX. Table 2 shows the functions of NetX and whether they were activated in this project. Of the functions supported by NetX, only the essential functions were selected, in order to prevent the code size from becoming too large. 3.5. Solving the Code Footprint Problem We implemented the real-time OS and the protocol stack onto an SPE, as well as a simple test program in the same SPE. This exhausted local storage, and no margin remained for additional functions or packet memory. Because the registers and operands in an SPE are 128-bit, more instructions are required to access a structure without alignment restrictions than when a standard 32-bit RISC processor is employed. Therefore, code sizes can easily become large in programs like protocol stacks that access structures frequently. In order to control the code size, we added alignment restrictions to the main structures in the NetX. Because we employed gcc as the compiler, we used the directive attribute ((aligned(N))), which was devel- oped for adding alignment restrictions. As a result, while the data size increased, the code size was reduced. The size of the .text segment was reduced from 145-KB to 113- KB after the alignment, representing a reduction of approx- imately 21%. However, the .bss segment increased approx- 122
  • 5. Table 3. local store usage Segment name .text packet pool .bss stacks .rodata .data tcp thread udp .debug ranges other total Before 144592 49920 39328 17408 4656 1264 480 352 336 128 3680 262144 After 113168 57600 45728 17408 4720 1264 1856 1312 1072 128 17888 262144 Ratio 0.78 1.15 1.16 1.00 1.01 1.00 3.87 3.73 3.19 1.00 4.86 1.00 imately 18% in size following the alignment, from 39-KB to 46-KB. Table 3 shows a comparison of the segment sizes in the LS before and after alignment. “packet pool” rep- resents the packet pool,with the section that maintains the packet payload at an MTU of 1500 bytes and the structure that controls the packets. The term, “stacks” refers to the stack area, which also holds a thread context, while “tcp” is the TCP control block, and “udp” is the UDP control block. In the table, “total” represents the size of the LS. Overall, the addition of alignment restrictions to the structures re- sulted in a size reduction of approximately 5%. Additionally, the alignment restrictions resulted in a 27% improvement of loopback performance in the TCP and 22% in the UDP. 3.6. Insertion of Branch Prediction to In- crease Speed Because SPEs do not have a branch prediction hardware, static hint branch instructions are used to indicate the fetch direction. However, even when branch instructions were counted by the number of “if” statements in the C source code, there were more than 1200. It was impractical to in- put the gcc builtin expect code manually. Instead, we used the gcc profile-guided optimization (PGO) to enable branch prediction. Communication was via loopbacks from the test program, and statistics on the direction of branching were gathered and fed back during compiling. Using PGO, we were able to improve TCP loopback performance by ap- proximately 22%. 3.7. Making the System Non-preemptive to Improve Performance The protocol stack that we employed in this project, NetX, is a commercial stack, and is designed for use upon the embedded real-time OS ThreadX, which is also man- ufactured by Express Logic. We ran NetX by porting ThreadX to the SPE. And preventing ThreadX from being triggered by any type of interrupt processing other than the timer processing of the thread by SPE decrementer inter- rupts. Interrupts of the SPE core can potentially be oper- ated by NICs from the MFC Memory Mapped I/O (MMIO) register, but the SPE processing speed is fast enough, and at low load, tasks can be rescheduled in round-robin cycles of approximately 400 nano seconds. Given this high-speed response, polling was employed. In addition, the exclusive use of an SPE made it unnecessary to run in coordination with unknown non-network processes (unlike CPU cores running a normal OS) and therefore the maximum delay could be controlled. ThreadX supports preemption, but we did not enable pre- emptive processing. As indicated above, the system is in control of the entire task processing in the SPE. Therefore non-preemptive task processing can be conducted without delays, and exclusion processing such as mutexes can be eliminated, resulting in higher speeds. Eliminating mutex processing enabled us to increase the bandwidth of loop- back processing by approximately 13%. 3.8. Use of the SPE to Drive NIC Because SPEs have a DMAC in every core, they are also capable of efficient NIC register accesses. On a con- ventinal processor core (like the PPE), the pipeline stalls for an extremely long period in units of instruction cycles during MMIO register access. In contrast, the DMAC in an SPE can be explicitly controlled by software in an SPE, enabling instructions to be issued without waiting for time- consuming MMIO accesses to complete, and programs can therefore continue to be executed during MMIO accesses. For example, when sending data, the NIC is notified of the update in the transmission descriptor, and this notification is normally implemented by writing the NIC MMIO register. With SPEs, a DMA transfer instruction can be issued with- out waiting for this write to complete, and prevents from stalling. RC-101 uses 32-byte descriptors placed in the ring buffer for sending and receiving. We placed the descriptors in the LS to control RC-101. When the send descriptor is updated, the RC-101 register is tapped and the descriptor read from RC-101 is kicked. Packet payloads are placed either in the LS or on main memory. When placed in the LS, the data flow is as shown by the solid line in Figure 2. The data flow when the packet payloads were placed on main memory is shown by the dotted line in the same figure. In the latter case, the payload has to be retrieved from the main memory, doubling the load on the memory system when compared to LS. However, because it is difficult to place multiple 9KB payloads in the LS in order to increase the MTU, it is neces- sary to place payloads on main memory and reduce the data resident in the LS for high bandwidth. 123
  • 6. 3.9. Increasing the Number of Supported Sessions When all functions were handled in the LS without using the main memory, the maximum number of connected ses- sions that could be supported simultaneously was 16. We increased the number of sessions in order to demonstrate that the SPE can be used for more realistic applications. It is difficult to increase the number of sessions because of the packet pool capacity and the sizes of the TCP control block (TCB) and the UDP control block (UCB). The TCB and UCB occupy approximately 1-KB and 512-B of the LS, re- spectively. Therefore if the TCP supports 128 sessions, for example, half the LS capacity is used. We were able to sup- port 128 simultaneous TCP sessions by using the techniques described below. • Software cache for TCB and UCB We placed the TCB and UCB on main memory. The protocol stack retrieves these into a small area in the LS when required. If the TCB and UCB were modi- fied they are restored to the main memory. These op- erations are handled by a software in the SPE. The TCB and UCB include the port number field which are frequently searched. We extracted such data into the LS and made resident for speed-up. • Placing queues in the main memory Because the packet pool capacity is essential to guar- anteeing the TCP retransmit and keeping the data re- ceiving. The data placed in the TCP queues was saved to the main memory and the packet pool in the LS was used as a cache area. 3.10. API for User Space Application We implemented the following application interfaces to enable the use of the protocol stack from Linux user space application processes. • Proprietary socket interface API (SOCK API) The SOCK API provides the NetX API without requir- ing changes in the user space application programs. The service programs that provide these API are lo- cated in the service application program section of the SPE-side components shown in Figure 3. These ser- vice programs use resources guaranteed by the PPE kernel module, and employ a memory area shared be- tween the user space and the kernel space to communi- cate with user processes. In addition, by calling NetX API, the service programs use the NetX functions to realize TCP or UDP communication. Figure 4. Test Environment for End to End Communications Figure 5. Test Environment for Loopback Communications • Remote Direct Memory Access protocol API (RDMA API) The RDMA API follows the instructions issued by user applications, and copies data using TCP to com- municate between remote nodes. Unique protocols were used in TCP. • SDP on RDMA API (SDP) To enable the use of Infiniband SDP functions [3], we used the library provided in the SDP through emula- tion of the HCA driver in kernel space. This is to sup- port applications using standard sockets. The SOCK API and the RDMA API cannot coexist at the same time. 4. Performance Evaluation This section discusses the measurement environment used in performance measurements, and the measurement 124
  • 7. results. 4.1. Measurement Environment The measurement environment of end-to-end communi- cations is shown in Figure 4. During measurements, RC- 101 units were connected directly with an optical cable with no switches in between. During end-to-end communica- tions, the sender side only transmitted data and the receiver side received data. The measurement environment of loopback communica- tion is shown in Figure 5. In loopback, the packets were reproduced in the protocol stack and returned by means of memory copies in NetX. During loopback, send and re- ceive transmissions were executed on a single SPE using two threads. Performance data is measured by running the test pro- gram shown in Figures 4 and 5 on the SPE where the SPE- side components are running. The NetX socket API is called directly from the test program. 10 Gigabits of data is transferred, and the throughput which includes the IP header and MAC header length is measured. We measured UDP throughput by varying UDP payload size. Also, we measured TCP throughput by varying TCP Max Segment Size (MSS) and TCP window size. 4.2. Loopback Communications Perfor- mance Loopback performance measures only the performance of the protocol stack. However, because the packets are copied and returned when NetX is used, results are largely affected by the additional memory copies. 4.2.1 UDP loopback Figure 6 shows the results of the UDP loopback test. The bandwidth reaches 7.5 Gbps when MTU is 1472 bytes. 4.2.2 TCP loopback Figure 7 shows the results of TCP loopback communica- tions. The bandwidth reaches 4.6 Gbps when MSS is 1460 and window size is approximately 26 KB (MSS × 18). 4.3. End-to-End Communications Perfor- mance Two types of data structures are used for communica- tion between the device driver and the protocol stack. The packet descriptor carries control information and is placed on LS to enable low overhead polling by SPE. The other data structure is a packet buffer. The large data structures 0 1000 2000 3000 4000 5000 6000 7000 8000 0 200 400 600 800 1000 1200 1400 1600 Throughput[Mbps] Packet Size [bytes] Figure 6. UDP Loopback Performance 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0 5000 10000 15000 20000 25000 30000 Throughput[Mbps] TCP Receive Window Size [bytes] MSS 128 MSS 512 MSS1460 Figure 7. TCP Loopback Performance 0 1000 2000 3000 4000 5000 6000 0 200 400 600 800 1000 1200 1400 1600 Throughput[Mbps] Packet Size [bytes] Figure 8. UDP End-to-End Performance (Us- ing LS Only) 125
  • 8. 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 Throughput[Mbps] Packet Size [bytes] Figure 9. UDP End-to-End Performance (Us- ing the Software Cache) 0 1000 2000 3000 4000 5000 6000 0 5000 10000 15000 20000 25000 30000 35000 40000 Throughput[Mbps] TCP Receive Window Size [bytes] MSS 128 MSS 512 MSS 1460 Figure 10. TCP End-To-End Performance (Us- ing LS Only) 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 0 10000 20000 30000 40000 50000 60000 Throughput[Mbps] TCP Receive Window Size [bytes] MSS 128 MSS 512 MSS 1460 MSS 3000 MSS 5000 MSS 7000 MSS 8960 Figure 11. TCP End-To-End Performance (Us- ing The Software Cache) that are accessed by the protocol stack are UCB and TCB. We compared two ways of arranging these data structures. • LS configuration This configuration puts the entire packet buffer, the UCB and the TCB on LS. This is good for data ac- cess latency and main memory loads. And thus attains the better packet handling rate. However, because of the LS size limitation, we limited the MTU size to 1500 bytes. With this configuration, eight descriptors are assigned for sending and four descriptors are as- signed for receiving. • Main memory configuration This configuration uses LS as a cache memory. The data structures that can be cached is UCB, TCB, and packet queues. The total size of these data structures can be larger than the LS size. Thus the MTU size and packet pool size can be configured to be large enough. 64 descriptors for RC-101 NIC are assigned for send- ing and 32 descriptors are for receiving. 4.3.1 UDP This section discusses the results for UDP end-to-end com- munications. Figures 8 and 9 show the results for UDP end- to-end communications. The figures represent the LS con- figuration and the main memory configuration, respectively. As shown in Figure 9, wire speeds were almost attained for UDP. The result of Figure 8 is not up to wire speed, because of the limitation on MTU size and the number of descrip- tors. In the main memory configuration, there is only one session, so there are no cache misses for UCB. But incom- ing packets with higher rates than the processing power are put into the UDP receive queue. This operation causes a main memory access and lowers the throughput. 4.3.2 TCP This section discusses the results for TCP end-to-end com- munications. Figures 10 and 11 show the results for TCP end-to-end communications. The figures represents the LS configuration and the main memory configuration, respec- tively. The throughput of Figure 10 shows much smoother results than Figure 11, because Figure 10 involves no main memory accesses and has static packet processing times. In Figure 10, the maximum throughput of 5.9 Gbps is reached when is MSS 1460 bytes and the TCP window size is 23- KB (MSS × 16). In Figure 11, a higher maximum throughput of 8.5 Gbps is reached at a 5000 bytes MSS and a 60-KB TCP win- dow size (MSS × 12). This is because the main memory 126
  • 9. configuration can utilize a larger MTU size and larger de- scriptors than the LS configuration. But in the main mem- ory configuration, both the sender and receiver must access the main memory at each packet processing to maintain transmit queue and receive queue. The overhead of these main memory accesses resulted in limiting throughput of the main memory configuration. 4.3.3 Maximum Packet Rate During transfers from LS, packet processing performance is approximately 870K [packet/sec]. During transfers from the main memory, packet processing performance is ap- proximately 530K [packet/sec]. These are measured in UDP end-to-end communications. 5. Conclusion The Cell Broadband EngineTM features eight processors with a unique design, called SPEs, on a single chip. We im- plemented a network protocol stack on one of these SPEs. On this single SPE, we operated the protocol stack NetX, developed for embedded applications, and the microkernel ThreadX required for its operation. Both were manufac- tured by Express Logic. In addition there are communica- tion interfaces enabling the protocol-stack-driven SPE to be accessed from device drivers and other processors. The most significant characteristic of SPE architecture is its omission of a cache hierarchy; the SPE has an LS that is controlled via DMA. This necessitates a different form of programming than that used for a conventional processor. In implementing the protocol stacks, we employed software caches for processing a number of data structures. We optimized the implementation of the NIC device drivers and the protocol stack in the fallowing various ways. • Communication between the network hardware and the processor was realized with a low overhead by means of direct access to the LS from the NIC. • Data alignment was adjusted to prevent performance from declining and the code size from increasing. (The SPE uses 128-bit load/store instructions, rather than scalar data load/store instructions.) • Profile-based static branching prediction was used to support code scheduling and branch hint instruction scheduling by the compiler. As a result, we achieved a TCP performance of 8.5 Gbps for 3-KB packet sizes using a single SPE operating at 3.2 GHz. This result indicates a competitive network proto- col processing performance, considering that we employed a processor designed for a variety of computational applica- tions rather than a dedicated network processor, and demon- strates the potential of the application of SPEs in this field. 6. Future Directions With the results achieved in this work as a starting point, the following directions can be indicated for the future de- velopment of network processing using SPEs. • Further optimization for SPE use We used 32-bit scalar code, and therefore did not make effective use of the SPE’s 128-bit SIMD data path. We believe that there is potential for optimization in nu- merous areas through the significant modification of code structures. For example conditional branches can be replaced by select bit instructions or shuffle byte operations can be used for packet header arrangement. Further optimization can be expected to result in im- proved performance. • Seamless integration in Linux kernels Because NetX is designed for embedded applications, it is not compatible with standard Linux protocol stacks, and therefore cannot replace Linux protocol stacks. Adding the required functions and integrat- ing the protocol stack in a Linux kernel would make it accessible from a Linux application in the same way as a standard protocol stack. We believe that a Linux kernel patch shared with a full TCP-IP offload using a network processor [7] could be used to enable this integration. • Implementation of various types of network packet processing In this work, we focused on basic protocol stack pro- cessing, but further network packet processing can be considered for the future. Encryption-related process- ing such as IP-SEC and DTCP over IP, packet filter- ing, and other applications can be named as the next targets. SPEs have shown excellent performance in encryption and decryption processing, as typified by AES [5]. SIMD processing can also be expected to in- crease the speed of pattern matching processing, and may be applicable for accelerating packet filtering. As the next step in our work, we are developing a test application that embeds the protocol stack discussed here. To avoid the time required for rewriting a large-scale soft- ware for the SPE, and to generate usable results in a short period, we are attempting to apply the concept of web ser- vices that employ combinations of highly independent ser- vices. Specifically, this combines a group of major element 127
  • 10. technologies and the Cell Broadband EngineTM to realize an independent service package that can be accessed via a network. In addition to the network interface driven by the pro- tocol stack discussed here, we intend to port and install a storage device, a file system and a database engine onto the SPE. By combining a single Cell Broadband EngineTM with commodity parts, we will implement a media data service package that includes media streaming on a bandwidth of several hundred MB/s, and meta-data processing. 7. Acknowledgements The authors would like to thank Bill Lamie and David Lamie of Express Logic for their cooperation and for dis- closing the NetX and ThreadX source code. The authors also thank to Hisashi Tomita, Youichi Mizu- tani, Masahiro Kajimoto, Yuichi Machida, Kazuyoshi Ya- mada, Tsuyoshi Kano, Mitsuki Hinosugi of Sony Corpora- tion for giving us the resources for evaluation with the RC- 101 10-Gigabit Ethernet Adapter and for helping in perfor- mance tuning. And the authors thank to Mariko Ozawa, Ken Kurihara, Takuhito Tsujimoto and Daniel Toshio Harrel for reviewing this paper. Finally, the author would like to thank children of Awa- fune Football Club for their exuberant energy. References [1] Cell Broadband EngineTM Home Page at Sony Computer Entertainment Inc. http://cell.scei.co.jp/. [2] Express Logic Home Page . http://www.rtos.com/. [3] Openfabrics alliance home page. http://www. openfabrics.org/. [4] M. Aron and P. Druschel. Soft timers: efficient microsec- ond software timer support for network processing. ACM Transactions on Computer Systems (TOCS), 18(3):197–228, 2000. [5] T. Chen, R. Raghavan, J. Dale, and E. Iwata. Cell Broadband Engine Architecture and its first implementation- A perfor- mance view. IBM Journal of Research and Development, 51(5):559–572, 2007. [6] P. Crowley, M. Fluczynski, J. Baer, and B. Bershad. Charac- terizing processor architectures for programmable network interfaces. Proceedings of the 14th international conference on Supercomputing, pages 54–65, 2000. [7] W. Feng, P. Balaji, C. Baron, L. Bhuyan, and D. Panda. Per- formance Characterization of a 10-Gigabit Ethernet TOE. Proceedings of the IEEE International Symposium on High- Performance Interconnects (HotI), 2005. [8] A. Foong, T. Huff, H. Hum, J. Patwardhan, and G. Regnier. TCP performance re-visited. Performance Analysis of Sys- tems and Software, 2003. ISPASS. 2003 IEEE International Symposium on, pages 70–79, 2003. [9] M. Gschwind. Chip multiprocessing and the cell broadband engine. Proceedings of the 3rd conference on Computing frontiers, pages 1–8, 2006. [10] M. Gschwind, P. Hofstee, B. Flachs, M. Hopkins, Y. Watan- abe, and T. Yamazaki. A novel simd architecture for the cell heterogeneous chip-multiprocessor. Hot Chips 17, August 2005. [11] H. Hofstee. Power efficient processor architecture and the cell processor. High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on, pages 258–262, 2005. [12] J. Kahle et al. Introduction to the Cell multiprocessor. IBM Journal of Research and Development, 49(4):589–604, 2005. [13] I. Kim, J. Moon, and H. Yeom. Timer-Based Interrupt Mit- igation for High Performance Packet Processing. Proceed- ings of 5th International Conference on High-Performance Computing in the Asia-Pacific Region, 2001. [14] J. Kurzak and J. Dongarra. Implementation of the Mixed- Precision High Performance LINPACK Benchmark on the CELL Processor Technical Report UT-CS-06-580. [15] G. Regnier, D. Minturn, G. McAlpine, V. Saletore, and A. Foong. ETA: Experience with an Intel Xeon Processor as a Packet Processing Engine. 2004. [16] K. Ryu, J. Hollingsworth, and P. Keleher. Efficient network and I/O throttling for fine-grain cycle stealing. Proceedings of Supercomputing01, 2001. [17] A. Sinha, S. Sarat, and J. Shapiro. Network subsystems reloaded: a high-performance, defensible network subsys- tem. Proceedings of the USENIX Annual Technical Confer- ence 2004 on USENIX Annual Technical Conference table of contents, pages 19–19, 2004. [18] W. Wu and M. Crawford. Potential performance bottleneck in linux tcp. Int. J. Commun. Syst., 20(11):1263–1283, 2007. [19] B. Wun and P. Crowley. Network I/O Acceleration in Heterogeneous Multicore Processors. Proceedings of the 14th IEEE Symposium on High-Performance Interconnects, pages 9–14, 2006. 128