Steen_Dissertation_March5

Steen Larsen
March 5 2015
Offloading of I/O Transactions in
Current CPU Architectures

Agenda Outline
 Introduction and motivation
 Background
 iDMA
 Hot Potato
 Device2Device
 Conclusions

Growing I/O System Performance
Discrepancy

Why such complexity?
 Programmed I/O (PIO)
 Direct CPU interaction with I/O devices
 Decades old method
 Extremely slow relative to CPU frequency
 CPU DMA engine
 “DMA channels”
 Limited by I/O device performance
 I/O devices need to be compatible with CPU DMA
engine
 These methods are seen in embedded devices, but
not in mainstream general purpose CPUs.
 System security aspect
 PCI & PCIe allows I/O devices to access memory

Background
 Direct integration
 Supercomputing I/O forwarding
 RDMA (Future I/O and NGIO)

Direct NIC Integration
 Sun Niagara CPU [8 core 64 threads]
 Dual 10GbE on-die
 Released in 2007

Super-Computers I/O Processors with I/O
Transaction Forwarding
 This approach leads to NGIO + Future IO => Infiniband & iWARP RDMA
 Connection context offloading (similar to TOE architectures)
 “30 minutes to print `Hello world`”
 http://www.mcs.anl.gov/papers/P1594A.pdf

iDMA Transaction Operations
iDMA transmit iDMA receive

iDMA Latency and Throughput Benefits
Total critical path latency = TxSW + TxHW + fiber + RxHW + RxSW

Hot Potato
 After survey of I/O transactions and acceleration
functions, we chose to stay with looking at
descriptors.
 Treat the payload data as a Hot Potato

Hot Potato Motivation:
Legacy NIC Internal Design
Transmit I/O Receive I/O

Write-Combining Buffers
CPU
Core
64B WC buffers
PCIe packet
24B header 64B payloadfull
full
full
full
full
full
[Myricom experiments]

PCIe Packet Framing
To discuss further we need to dig into PCIe protocol details:

Typical ICMP Ping Sequence
Doorbell
write
Descriptor
read
ICMP packet
IP address
“C0A80001”

Hot-Potato Latency and Throughput
Benefit

Measurements and Conclusions
 1.5us latency reduction in benchmark tests
 8% latency reduction in real memcached application

Device2Device (D2D)
 Shift gears from CPU I/O transactions to inter-
device communication.

CPU
System Memory
Receiver Device
(e.g., NIC)
TX
Queues
RX
Queues
Sender Device
(e.g., SSD)
Disk Write
Queues
Disk Read
Queues
DDR3
Kernel buffer for
disk data
Kernel buffer for
network packet data
3
2
1
Legacy Video Streaming from Storage

Details of a D2D NIC
D2D-enabled NIC
Transmit
Packet
Queue(s)
TX PHY
Receive
Packet
Queue(s)
RX PHY
Legacy
Rx DMA
UOE / Packet-Based Priority Control
PCIe BARx
Memory
Space
PCIe TLP/DLP/LLP processing
Legacy NIC Control
Registers
D2D UOE Control
Registers
Legacy
Tx DMA
UOE / Parse Control
D2D Rx FSM
D2D Rx Queue
+
Frame Header
D2D Tx Queue
+
Frame Header
D2D Tx FSM
Optional
D2D Flow Control
Registers
PCIe PHY

Details of a D2D SSD
D2D-enabled SSD
Legacy
Tx DMA
Legacy
Rx DMA
Optional
SSD
SSD Controller and Buffers
Legacy NIC Control
Registers
D2D Flow Control
Registers
PCIe BARx
Memory Space
D2D Tx FSMD2D Rx FSM
PCIe PHY
D2D Tx QueueD2D Rx Queue
Modified Rx
DMA
Modified Tx
DMA

D2D Transmit FSM
 INIT state: Sets Tx Flow Control registers
◦ Tx Address register
◦ D2D Transmit Byte Count register
◦ Data Rate and Granularity parameters
◦ Tx and Rx Base Credits
 Parse state:
◦ Map OS block addresses to SSD
physical addresses
◦ Enqueued in D2D Tx Queue
 Send state:
◦ Fetch and forward SSD block data to
PCIe interface.
 Wait state:
◦ Waits until the next chunk needs to be
sent.
◦ Depends on Data Rate and Granularity
 Check state:
◦ Checks whether bytes sent < D2D
Transmit Byte Count
Check
Idle
Init
Parse
Send
Wait
True
False

D2D Receive FSM (with UOE)
 INIT state: Set D2D Flow & UOE Control
registers
◦ MAC source (SSD) and destination (NIC)
addresses.
◦ Source and destination IP addresses.
◦ UDP source and destination port, length, and
checksum.
◦ RTP version, sequence #, and timestamp.
 Fetch state:
◦ Monitor D2D Rx Queue for data.
 Frame state:
◦ Assign static fields: MAC & IP addresses.
 Calc state:
◦ Assign Ethernet length & CRC, IP length &
checksum, UDP length & checksum, RTP
timestamp & sequence #.
◦ Enqueue in Tx Packet Queue
 Send state:
◦ Send to MAC layer for transmission
Idle
Init
Fetch
Frame
Send
Calc

NetFPGA Logical Architecture
Xilinx Virtex-5 TX240T FPGA, 10GbE, and memories
AXI Lite
AMB AXI-Stream Interface (160 MHz, 64-bit)
DMA Engine Registers
nf0 nf1 nf2 nf3 ioctl
MA
C
TxQ
MA
C
RxQ
Ethernet
MA
C
TxQ
MA
C
RxQ
MA
C
TxQ
MA
C
RxQ
MAC
TxQ
MAC
RxQ
Shared PCIe interface
PCIe
Interface
Layer
Software
Interface
Layer
NetFPGA
Internal
Control and
Routing
D2D Tx/Rx
Queues
D2D Tx/Rx FSMs
Legacy
DMA RxQ
Legacy
DMA Tx Q
Read/write
from/to D2D
control registers
Control & Status
Interface
Data Path
Interface

D2D VoD streaming configuration
PCIe Interface
PCIe Bridge
CPU System Memory
PCIe Interface
Incoming network data is
treated as storage data to be
written to D2D-enabled NIC
System Output Display (SOD)
Out-bound device: UDP VoD stream
Stimulus System (SS)
In-bound device: Emulated SSD stream
System Under Test (SUT)
PCIe Interface
D2D-enabled NIC
Transmit
Packet
Queue(s)
TX PHY
Receive
Packet
Queue(s)
RX PHY
Legacy RX
DMA
PCIe BARx
Memory
Space
Legacy NIC Control
Registers
D2D Flow Control Registers D2D UOE Control Registers
Legacy
TX DMA
PCIe
PHY
UOE / Parse Control
D2D Rx FSM
D2D Rx Queue
+
Frame Header
D2D Tx Queue
+
Frame Header
D2D Tx FSM
D2D-enabled NIC
Transmit
Packet
Queue(s)
TX PHY
Receive
Packet
Queue(s)
RX PHY
Legacy RX
DMA
PCIe BARx
Memory
Space
Legacy NIC Control
Registers
D2D Flow Control Registers D2D UOE Control Registers
Legacy TX
DMA
PCIe
PHY
UOE / Parse Control
D2D Rx FSM
D2D Rx Queue
+
Frame Header
D2D Tx Queue
+
Frame Header
D2D Tx FSM

D2D Physical Configuration
System Under
Test (SUT)
Shared KVM
System Output
Display (SOD)
1000 Watt
SUT power
supply
DMM measuring
CPU+CPU VRM
12V current
Stimulus
System
(SS)
FPGA programmer

SUT Configuration
SSD Linux
boot drive
PCIe x8 bridge
Fan to add cooling
to FPGA fans
Emulated SSD
NetFPGA
Spliced power
supply to CPU
for ammeter
D2D NIC
NetFPGA
Xilinx USB
programmer for
FPGA and
Chipscope
Intel 2500 4-
core 3.1GHz
CPU
2GB
1333MHz
DDR3

D2D Latency (1500 byte packet)

D2D Power and Utilization benefit

D2D Measured Throughput
(and Limitations)

Conclusions
 CPU-based descriptor DMA makes sense in the
context of off-loading slow I/O devices when
additional overhead was small relative to overall
latency, power, throughput
 This work proposes small additional changes in
hardware and software that bypass this descriptor
overhead.
 Depending on the application I/O transaction
profile, benefits in latency, throughput, and power
are significant.

Non-Transparent Bridging (NTB)
Host BHost A
Device Device
BARTranslate
BAR Translate
PCIe switch
Device Device
BARTranslate
BAR Translate
PCIe switch
Host B memory write
to host A
Host A memory write
to host B

Basic video frame buffer format
 4 bytes defined
per pixel.
 Frame buffer
mapped to linear
system memory
space
 FPGA writes in bit
level compatible
format
 Verified with PCIe
trace analyzer
Video screen
Target stream space
(i.e. 640x480)
Pixel information (4B per pixel)
[0x0]
[0x1]
[0x2]
[0x3] transparency
pixel

D2D video stream timeline
Time
SOD SUT SS
VoD server configuration
SOD configures D2D
stream configuration in
SUT
SOD requests UDP video
stream on specific UDP
port from SS (emulated
SSD)
SS begins streaming video
to SUT
SUT UOE strips packet
header and passes to D2D
TX queue
SUT UOE frames new
packet to the SOD
SOD decodes video frames
and displays
Repeated pipelined VoD
packets
End of video stream, or
SOD termination.

Steen_Dissertation_March5

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Steen_Dissertation_March5

Ähnlich wie Steen_Dissertation_March5 (20)

Steen_Dissertation_March5

Hinweis der Redaktion