5. Why such complexity?
Programmed I/O (PIO)
Direct CPU interaction with I/O devices
Decades old method
Extremely slow relative to CPU frequency
CPU DMA engine
“DMA channels”
Limited by I/O device performance
I/O devices need to be compatible with CPU DMA
engine
These methods are seen in embedded devices, but
not in mainstream general purpose CPUs.
System security aspect
PCI & PCIe allows I/O devices to access memory
13. Hot Potato
After survey of I/O transactions and acceleration
functions, we chose to stay with looking at
descriptors.
Treat the payload data as a Hot Potato
24. CPU
System Memory
Receiver Device
(e.g., NIC)
TX
Queues
RX
Queues
Sender Device
(e.g., SSD)
Disk Write
Queues
Disk Read
Queues
DDR3
Kernel buffer for
disk data
Kernel buffer for
network packet data
3
2
1
Legacy Video Streaming from Storage
25. Details of a D2D NIC
D2D-enabled NIC
Transmit
Packet
Queue(s)
TX PHY
Receive
Packet
Queue(s)
RX PHY
Legacy
Rx DMA
UOE / Packet-Based Priority Control
PCIe BARx
Memory
Space
PCIe TLP/DLP/LLP processing
Legacy NIC Control
Registers
D2D UOE Control
Registers
Legacy
Tx DMA
UOE / Parse Control
D2D Rx FSM
D2D Rx Queue
+
Frame Header
D2D Tx Queue
+
Frame Header
D2D Tx FSM
Optional
D2D Flow Control
Registers
PCIe PHY
26. Details of a D2D SSD
D2D-enabled SSD
Legacy
Tx DMA
PCIe TLP/DLP/LLP processing
Legacy
Rx DMA
Optional
SSD
SSD Controller and Buffers
Legacy NIC Control
Registers
D2D Flow Control
Registers
PCIe BARx
Memory Space
D2D Tx FSMD2D Rx FSM
PCIe PHY
D2D Tx QueueD2D Rx Queue
Modified Rx
DMA
Modified Tx
DMA
27. D2D Transmit FSM
INIT state: Sets Tx Flow Control registers
◦ Tx Address register
◦ D2D Transmit Byte Count register
◦ Data Rate and Granularity parameters
◦ Tx and Rx Base Credits
Parse state:
◦ Map OS block addresses to SSD
physical addresses
◦ Enqueued in D2D Tx Queue
Send state:
◦ Fetch and forward SSD block data to
PCIe interface.
Wait state:
◦ Waits until the next chunk needs to be
sent.
◦ Depends on Data Rate and Granularity
Check state:
◦ Checks whether bytes sent < D2D
Transmit Byte Count
Check
Idle
Init
Parse
Send
Wait
True
False
28. D2D Receive FSM (with UOE)
INIT state: Set D2D Flow & UOE Control
registers
◦ MAC source (SSD) and destination (NIC)
addresses.
◦ Source and destination IP addresses.
◦ UDP source and destination port, length, and
checksum.
◦ RTP version, sequence #, and timestamp.
Fetch state:
◦ Monitor D2D Rx Queue for data.
Frame state:
◦ Assign static fields: MAC & IP addresses.
Calc state:
◦ Assign Ethernet length & CRC, IP length &
checksum, UDP length & checksum, RTP
timestamp & sequence #.
◦ Enqueue in Tx Packet Queue
Send state:
◦ Send to MAC layer for transmission
Idle
Init
Fetch
Frame
Send
Calc
29. NetFPGA Logical Architecture
Xilinx Virtex-5 TX240T FPGA, 10GbE, and memories
AXI Lite
AMB AXI-Stream Interface (160 MHz, 64-bit)
DMA Engine Registers
nf0 nf1 nf2 nf3 ioctl
MA
C
TxQ
MA
C
RxQ
Ethernet
MA
C
TxQ
MA
C
RxQ
MA
C
TxQ
MA
C
RxQ
MAC
TxQ
MAC
RxQ
Shared PCIe interface
PCIe
Interface
Layer
Software
Interface
Layer
NetFPGA
Internal
Control and
Routing
D2D Tx/Rx
Queues
D2D Tx/Rx FSMs
Legacy
DMA RxQ
Legacy
DMA Tx Q
Read/write
from/to D2D
control registers
Control & Status
Interface
Data Path
Interface
32. D2D VoD streaming configuration
PCIe Interface
PCIe Bridge
CPU System Memory
PCIe Interface
Incoming network data is
treated as storage data to be
written to D2D-enabled NIC
System Output Display (SOD)
Out-bound device: UDP VoD stream
Stimulus System (SS)
In-bound device: Emulated SSD stream
System Under Test (SUT)
PCIe Interface
D2D-enabled NIC
Transmit
Packet
Queue(s)
TX PHY
Receive
Packet
Queue(s)
RX PHY
Legacy RX
DMA
PCIe BARx
Memory
Space
PCIe TLP/DLP/LLP processing
Legacy NIC Control
Registers
D2D Flow Control Registers D2D UOE Control Registers
Legacy
TX DMA
PCIe
PHY
UOE / Parse Control
D2D Rx FSM
D2D Rx Queue
+
Frame Header
D2D Tx Queue
+
Frame Header
D2D Tx FSM
D2D-enabled NIC
Transmit
Packet
Queue(s)
TX PHY
Receive
Packet
Queue(s)
RX PHY
Legacy RX
DMA
UOE / Packet-Based Priority Control
PCIe BARx
Memory
Space
PCIe TLP/DLP/LLP processing
Legacy NIC Control
Registers
D2D Flow Control Registers D2D UOE Control Registers
Legacy TX
DMA
PCIe
PHY
UOE / Parse Control
D2D Rx FSM
D2D Rx Queue
+
Frame Header
D2D Tx Queue
+
Frame Header
D2D Tx FSM
UOE / Packet-Based Priority Control
33. D2D Physical Configuration
System Under
Test (SUT)
Shared KVM
System Output
Display (SOD)
1000 Watt
SUT power
supply
DMM measuring
CPU+CPU VRM
12V current
Stimulus
System
(SS)
FPGA programmer
34. SUT Configuration
SSD Linux
boot drive
PCIe x8 bridge
Fan to add cooling
to FPGA fans
Emulated SSD
NetFPGA
Spliced power
supply to CPU
for ammeter
D2D NIC
NetFPGA
Xilinx USB
programmer for
FPGA and
Chipscope
Intel 2500 4-
core 3.1GHz
CPU
2GB
1333MHz
DDR3
38. Conclusions
CPU-based descriptor DMA makes sense in the
context of off-loading slow I/O devices when
additional overhead was small relative to overall
latency, power, throughput
This work proposes small additional changes in
hardware and software that bypass this descriptor
overhead.
Depending on the application I/O transaction
profile, benefits in latency, throughput, and power
are significant.
40. Non-Transparent Bridging (NTB)
Host BHost A
Device Device
BARTranslate
BAR Translate
PCIe switch
Device Device
BARTranslate
BAR Translate
PCIe switch
Host B memory write
to host A
Host A memory write
to host B
41. Basic video frame buffer format
4 bytes defined
per pixel.
Frame buffer
mapped to linear
system memory
space
FPGA writes in bit
level compatible
format
Verified with PCIe
trace analyzer
Video screen
Target stream space
(i.e. 640x480)
Pixel information (4B per pixel)
[0x0]
[0x1]
[0x2]
[0x3] transparency
pixel
42. D2D video stream timeline
Time
SOD SUT SS
VoD server configuration
SOD configures D2D
stream configuration in
SUT
SOD requests UDP video
stream on specific UDP
port from SS (emulated
SSD)
SS begins streaming video
to SUT
SUT UOE strips packet
header and passes to D2D
TX queue
SUT UOE frames new
packet to the SOD
SOD decodes video frames
and displays
Repeated pipelined VoD
packets
End of video stream, or
SOD termination.
Hinweis der Redaktion
Emphasize data movement
PCIe prevalence
AKA UPI or KTI