SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Steen Larsen
March 5 2015
Offloading of I/O Transactions in
Current CPU Architectures
Agenda Outline
 Introduction and motivation
 Background
 iDMA
 Hot Potato
 Device2Device
 Conclusions
Growing I/O System Performance
Discrepancy
Legacy I/O Transmit Operation
Why such complexity?
 Programmed I/O (PIO)
 Direct CPU interaction with I/O devices
 Decades old method
 Extremely slow relative to CPU frequency
 CPU DMA engine
 “DMA channels”
 Limited by I/O device performance
 I/O devices need to be compatible with CPU DMA
engine
 These methods are seen in embedded devices, but
not in mainstream general purpose CPUs.
 System security aspect
 PCI & PCIe allows I/O devices to access memory
Background
 Direct integration
 Supercomputing I/O forwarding
 RDMA (Future I/O and NGIO)
Direct NIC Integration
 Sun Niagara CPU [8 core 64 threads]
 Dual 10GbE on-die
 Released in 2007
Super-Computers I/O Processors with I/O
Transaction Forwarding
 This approach leads to NGIO + Future IO => Infiniband & iWARP RDMA
 Connection context offloading (similar to TOE architectures)
 “30 minutes to print `Hello world`”
 http://www.mcs.anl.gov/papers/P1594A.pdf
iDMA
iDMA Transaction Operations
iDMA transmit iDMA receive
iDMA Latency and Throughput Benefits
Total critical path latency = TxSW + TxHW + fiber + RxHW + RxSW
iDMA Summary
Hot Potato
 After survey of I/O transactions and acceleration
functions, we chose to stay with looking at
descriptors.
 Treat the payload data as a Hot Potato
Hot Potato Motivation:
Legacy NIC Internal Design
Transmit I/O Receive I/O
Write-Combining Buffers
CPU
Core
64B WC buffers
PCIe packet
24B header 64B payloadfull
full
full
full
full
full
[Myricom experiments]
Hot Potato Device Design
Hot Potato Prototype
PCIe Packet Framing
To discuss further we need to dig into PCIe protocol details:
Typical ICMP Ping Sequence
Doorbell
write
Descriptor
read
ICMP packet
IP address
“C0A80001”
Example Hot-Potato Loopback
Hot-Potato Latency and Throughput
Benefit
Measurements and Conclusions
 1.5us latency reduction in benchmark tests
 8% latency reduction in real memcached application
Device2Device (D2D)
 Shift gears from CPU I/O transactions to inter-
device communication.
CPU
System Memory
Receiver Device
(e.g., NIC)
TX
Queues
RX
Queues
Sender Device
(e.g., SSD)
Disk Write
Queues
Disk Read
Queues
DDR3
Kernel buffer for
disk data
Kernel buffer for
network packet data
3
2
1
Legacy Video Streaming from Storage
Details of a D2D NIC
D2D-enabled NIC
Transmit
Packet
Queue(s)
TX PHY
Receive
Packet
Queue(s)
RX PHY
Legacy
Rx DMA
UOE / Packet-Based Priority Control
PCIe BARx
Memory
Space
PCIe TLP/DLP/LLP processing
Legacy NIC Control
Registers
D2D UOE Control
Registers
Legacy
Tx DMA
UOE / Parse Control
D2D Rx FSM
D2D Rx Queue
+
Frame Header
D2D Tx Queue
+
Frame Header
D2D Tx FSM
Optional
D2D Flow Control
Registers
PCIe PHY
Details of a D2D SSD
D2D-enabled SSD
Legacy
Tx DMA
PCIe TLP/DLP/LLP processing
Legacy
Rx DMA
Optional
SSD
SSD Controller and Buffers
Legacy NIC Control
Registers
D2D Flow Control
Registers
PCIe BARx
Memory Space
D2D Tx FSMD2D Rx FSM
PCIe PHY
D2D Tx QueueD2D Rx Queue
Modified Rx
DMA
Modified Tx
DMA
D2D Transmit FSM
 INIT state: Sets Tx Flow Control registers
◦ Tx Address register
◦ D2D Transmit Byte Count register
◦ Data Rate and Granularity parameters
◦ Tx and Rx Base Credits
 Parse state:
◦ Map OS block addresses to SSD
physical addresses
◦ Enqueued in D2D Tx Queue
 Send state:
◦ Fetch and forward SSD block data to
PCIe interface.
 Wait state:
◦ Waits until the next chunk needs to be
sent.
◦ Depends on Data Rate and Granularity
 Check state:
◦ Checks whether bytes sent < D2D
Transmit Byte Count
Check
Idle
Init
Parse
Send
Wait
True
False
D2D Receive FSM (with UOE)
 INIT state: Set D2D Flow & UOE Control
registers
◦ MAC source (SSD) and destination (NIC)
addresses.
◦ Source and destination IP addresses.
◦ UDP source and destination port, length, and
checksum.
◦ RTP version, sequence #, and timestamp.
 Fetch state:
◦ Monitor D2D Rx Queue for data.
 Frame state:
◦ Assign static fields: MAC & IP addresses.
 Calc state:
◦ Assign Ethernet length & CRC, IP length &
checksum, UDP length & checksum, RTP
timestamp & sequence #.
◦ Enqueue in Tx Packet Queue
 Send state:
◦ Send to MAC layer for transmission
Idle
Init
Fetch
Frame
Send
Calc
NetFPGA Logical Architecture
Xilinx Virtex-5 TX240T FPGA, 10GbE, and memories
AXI Lite
AMB AXI-Stream Interface (160 MHz, 64-bit)
DMA Engine Registers
nf0 nf1 nf2 nf3 ioctl
MA
C
TxQ
MA
C
RxQ
Ethernet
MA
C
TxQ
MA
C
RxQ
MA
C
TxQ
MA
C
RxQ
MAC
TxQ
MAC
RxQ
Shared PCIe interface
PCIe
Interface
Layer
Software
Interface
Layer
NetFPGA
Internal
Control and
Routing
D2D Tx/Rx
Queues
D2D Tx/Rx FSMs
Legacy
DMA RxQ
Legacy
DMA Tx Q
Read/write
from/to D2D
control registers
Control & Status
Interface
Data Path
Interface
Sample Chipscope trace
NIC - to - Video
D2D VoD streaming configuration
PCIe Interface
PCIe Bridge
CPU System Memory
PCIe Interface
Incoming network data is
treated as storage data to be
written to D2D-enabled NIC
System Output Display (SOD)
Out-bound device: UDP VoD stream
Stimulus System (SS)
In-bound device: Emulated SSD stream
System Under Test (SUT)
PCIe Interface
D2D-enabled NIC
Transmit
Packet
Queue(s)
TX PHY
Receive
Packet
Queue(s)
RX PHY
Legacy RX
DMA
PCIe BARx
Memory
Space
PCIe TLP/DLP/LLP processing
Legacy NIC Control
Registers
D2D Flow Control Registers D2D UOE Control Registers
Legacy
TX DMA
PCIe
PHY
UOE / Parse Control
D2D Rx FSM
D2D Rx Queue
+
Frame Header
D2D Tx Queue
+
Frame Header
D2D Tx FSM
D2D-enabled NIC
Transmit
Packet
Queue(s)
TX PHY
Receive
Packet
Queue(s)
RX PHY
Legacy RX
DMA
UOE / Packet-Based Priority Control
PCIe BARx
Memory
Space
PCIe TLP/DLP/LLP processing
Legacy NIC Control
Registers
D2D Flow Control Registers D2D UOE Control Registers
Legacy TX
DMA
PCIe
PHY
UOE / Parse Control
D2D Rx FSM
D2D Rx Queue
+
Frame Header
D2D Tx Queue
+
Frame Header
D2D Tx FSM
UOE / Packet-Based Priority Control
D2D Physical Configuration
System Under
Test (SUT)
Shared KVM
System Output
Display (SOD)
1000 Watt
SUT power
supply
DMM measuring
CPU+CPU VRM
12V current
Stimulus
System
(SS)
FPGA programmer
SUT Configuration
SSD Linux
boot drive
PCIe x8 bridge
Fan to add cooling
to FPGA fans
Emulated SSD
NetFPGA
Spliced power
supply to CPU
for ammeter
D2D NIC
NetFPGA
Xilinx USB
programmer for
FPGA and
Chipscope
Intel 2500 4-
core 3.1GHz
CPU
2GB
1333MHz
DDR3
D2D Latency (1500 byte packet)
D2D Power and Utilization benefit
D2D Measured Throughput
(and Limitations)
Conclusions
 CPU-based descriptor DMA makes sense in the
context of off-loading slow I/O devices when
additional overhead was small relative to overall
latency, power, throughput
 This work proposes small additional changes in
hardware and software that bypass this descriptor
overhead.
 Depending on the application I/O transaction
profile, benefits in latency, throughput, and power
are significant.
BACKUP
Non-Transparent Bridging (NTB)
Host BHost A
Device Device
BARTranslate
BAR Translate
PCIe switch
Device Device
BARTranslate
BAR Translate
PCIe switch
Host B memory write
to host A
Host A memory write
to host B
Basic video frame buffer format
 4 bytes defined
per pixel.
 Frame buffer
mapped to linear
system memory
space
 FPGA writes in bit
level compatible
format
 Verified with PCIe
trace analyzer
Video screen
Target stream space
(i.e. 640x480)
Pixel information (4B per pixel)
[0x0]
[0x1]
[0x2]
[0x3] transparency
pixel
D2D video stream timeline
Time
SOD SUT SS
VoD server configuration
SOD configures D2D
stream configuration in
SUT
SOD requests UDP video
stream on specific UDP
port from SS (emulated
SSD)
SS begins streaming video
to SUT
SUT UOE strips packet
header and passes to D2D
TX queue
SUT UOE frames new
packet to the SOD
SOD decodes video frames
and displays
Repeated pipelined VoD
packets
End of video stream, or
SOD termination.

Weitere ähnliche Inhalte

Was ist angesagt?

Bluetooth Aplication
Bluetooth AplicationBluetooth Aplication
Bluetooth AplicationEr Bhaduri
 
An Introduction to BLUETOOTH TECHNOLOGY
An Introduction to BLUETOOTH TECHNOLOGYAn Introduction to BLUETOOTH TECHNOLOGY
An Introduction to BLUETOOTH TECHNOLOGYVikas Jagtap
 
Bluetooth Intro
Bluetooth IntroBluetooth Intro
Bluetooth Introamit_monty
 
Bluetooth mobileip
Bluetooth mobileipBluetooth mobileip
Bluetooth mobileipRamya Sasi
 
Bluetooth and profiles on WEC7
Bluetooth and profiles on WEC7Bluetooth and profiles on WEC7
Bluetooth and profiles on WEC7gnkeshava
 
Bluetooth Security
Bluetooth SecurityBluetooth Security
Bluetooth Securityh_marvin
 
DASH7 Alliance Protocol Technical Presentation
DASH7 Alliance Protocol Technical PresentationDASH7 Alliance Protocol Technical Presentation
DASH7 Alliance Protocol Technical PresentationMaarten Weyn
 
Bluetooth based smart sensor devices 2
Bluetooth based smart sensor devices 2Bluetooth based smart sensor devices 2
Bluetooth based smart sensor devices 2Vijay Kribpz
 
Protocols in Bluetooth
Protocols in BluetoothProtocols in Bluetooth
Protocols in BluetoothSonali Parab
 
Bluetooth & Bluetooth Low Energy internals
Bluetooth & Bluetooth Low Energy internalsBluetooth & Bluetooth Low Energy internals
Bluetooth & Bluetooth Low Energy internalsDavy Jacops
 

Was ist angesagt? (20)

Bluetooth Aplication
Bluetooth AplicationBluetooth Aplication
Bluetooth Aplication
 
Bluetooth.ppt
Bluetooth.pptBluetooth.ppt
Bluetooth.ppt
 
An Introduction to BLUETOOTH TECHNOLOGY
An Introduction to BLUETOOTH TECHNOLOGYAn Introduction to BLUETOOTH TECHNOLOGY
An Introduction to BLUETOOTH TECHNOLOGY
 
Bluetooth presentation
Bluetooth presentationBluetooth presentation
Bluetooth presentation
 
Bluetooth Intro
Bluetooth IntroBluetooth Intro
Bluetooth Intro
 
Bluetooth
BluetoothBluetooth
Bluetooth
 
Bluetooth mobileip
Bluetooth mobileipBluetooth mobileip
Bluetooth mobileip
 
Bluetooth and profiles on WEC7
Bluetooth and profiles on WEC7Bluetooth and profiles on WEC7
Bluetooth and profiles on WEC7
 
Bluetooth Security
Bluetooth SecurityBluetooth Security
Bluetooth Security
 
Bluetooth
BluetoothBluetooth
Bluetooth
 
Bluetooth
BluetoothBluetooth
Bluetooth
 
Bluetooth profile
Bluetooth profileBluetooth profile
Bluetooth profile
 
Ccnafile
CcnafileCcnafile
Ccnafile
 
DASH7 Alliance Protocol Technical Presentation
DASH7 Alliance Protocol Technical PresentationDASH7 Alliance Protocol Technical Presentation
DASH7 Alliance Protocol Technical Presentation
 
Bluetooth based smart sensor devices 2
Bluetooth based smart sensor devices 2Bluetooth based smart sensor devices 2
Bluetooth based smart sensor devices 2
 
Protocols in Bluetooth
Protocols in BluetoothProtocols in Bluetooth
Protocols in Bluetooth
 
Bluetooth technology
Bluetooth technologyBluetooth technology
Bluetooth technology
 
Zigbee module interface with ARM 7
Zigbee module interface with ARM 7Zigbee module interface with ARM 7
Zigbee module interface with ARM 7
 
Bluetooth & Bluetooth Low Energy internals
Bluetooth & Bluetooth Low Energy internalsBluetooth & Bluetooth Low Energy internals
Bluetooth & Bluetooth Low Energy internals
 
Private Branch Exchange
Private Branch ExchangePrivate Branch Exchange
Private Branch Exchange
 

Ähnlich wie Steen_Dissertation_March5

ODSA Proof of Concept SmartNIC Speeds & Feeds
ODSA Proof of Concept SmartNIC Speeds & FeedsODSA Proof of Concept SmartNIC Speeds & Feeds
ODSA Proof of Concept SmartNIC Speeds & FeedsODSA Workgroup
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storageKohei KaiGai
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linuxbrouer
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingMichelle Holley
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Ontico
 
MPC8313E PowerQUICC II Pro Processor
MPC8313E PowerQUICC II Pro ProcessorMPC8313E PowerQUICC II Pro Processor
MPC8313E PowerQUICC II Pro ProcessorPremier Farnell
 
BUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoCBUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoCLinaro
 
Introducing OMAP-L138/AM1808 Processor Architecture and Hawkboard Peripherals
Introducing OMAP-L138/AM1808 Processor Architecture and Hawkboard PeripheralsIntroducing OMAP-L138/AM1808 Processor Architecture and Hawkboard Peripherals
Introducing OMAP-L138/AM1808 Processor Architecture and Hawkboard PeripheralsPremier Farnell
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA CampPCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA CampFPGA Central
 
BRKRST-3068 Troubleshooting Catalyst 2K and 3K.pdf
BRKRST-3068  Troubleshooting Catalyst 2K and 3K.pdfBRKRST-3068  Troubleshooting Catalyst 2K and 3K.pdf
BRKRST-3068 Troubleshooting Catalyst 2K and 3K.pdfssusercbaa33
 
OSN days 2019 - Open Networking and Programmable Switch
OSN days 2019 - Open Networking and Programmable SwitchOSN days 2019 - Open Networking and Programmable Switch
OSN days 2019 - Open Networking and Programmable SwitchChun Ming Ou
 
Overview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit MicrocontrollersOverview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit MicrocontrollersPremier Farnell
 
Introduction to Programmable Networks by Clarence Anslem, Intel
Introduction to Programmable Networks by Clarence Anslem, IntelIntroduction to Programmable Networks by Clarence Anslem, Intel
Introduction to Programmable Networks by Clarence Anslem, IntelMyNOG
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_finalYutaka Kawai
 

Ähnlich wie Steen_Dissertation_March5 (20)

Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
ODSA Proof of Concept SmartNIC Speeds & Feeds
ODSA Proof of Concept SmartNIC Speeds & FeedsODSA Proof of Concept SmartNIC Speeds & Feeds
ODSA Proof of Concept SmartNIC Speeds & Feeds
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
No[1][1]
No[1][1]No[1][1]
No[1][1]
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
 
MPC8313E PowerQUICC II Pro Processor
MPC8313E PowerQUICC II Pro ProcessorMPC8313E PowerQUICC II Pro Processor
MPC8313E PowerQUICC II Pro Processor
 
Ccna Imp Guide
Ccna Imp GuideCcna Imp Guide
Ccna Imp Guide
 
BUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoCBUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoC
 
slides
slidesslides
slides
 
Introducing OMAP-L138/AM1808 Processor Architecture and Hawkboard Peripherals
Introducing OMAP-L138/AM1808 Processor Architecture and Hawkboard PeripheralsIntroducing OMAP-L138/AM1808 Processor Architecture and Hawkboard Peripherals
Introducing OMAP-L138/AM1808 Processor Architecture and Hawkboard Peripherals
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA CampPCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
 
BRKRST-3068 Troubleshooting Catalyst 2K and 3K.pdf
BRKRST-3068  Troubleshooting Catalyst 2K and 3K.pdfBRKRST-3068  Troubleshooting Catalyst 2K and 3K.pdf
BRKRST-3068 Troubleshooting Catalyst 2K and 3K.pdf
 
OSN days 2019 - Open Networking and Programmable Switch
OSN days 2019 - Open Networking and Programmable SwitchOSN days 2019 - Open Networking and Programmable Switch
OSN days 2019 - Open Networking and Programmable Switch
 
Overview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit MicrocontrollersOverview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit Microcontrollers
 
Introduction to Programmable Networks by Clarence Anslem, Intel
Introduction to Programmable Networks by Clarence Anslem, IntelIntroduction to Programmable Networks by Clarence Anslem, Intel
Introduction to Programmable Networks by Clarence Anslem, Intel
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final
 

Steen_Dissertation_March5

  • 1. Steen Larsen March 5 2015 Offloading of I/O Transactions in Current CPU Architectures
  • 2. Agenda Outline  Introduction and motivation  Background  iDMA  Hot Potato  Device2Device  Conclusions
  • 3. Growing I/O System Performance Discrepancy
  • 5. Why such complexity?  Programmed I/O (PIO)  Direct CPU interaction with I/O devices  Decades old method  Extremely slow relative to CPU frequency  CPU DMA engine  “DMA channels”  Limited by I/O device performance  I/O devices need to be compatible with CPU DMA engine  These methods are seen in embedded devices, but not in mainstream general purpose CPUs.  System security aspect  PCI & PCIe allows I/O devices to access memory
  • 6. Background  Direct integration  Supercomputing I/O forwarding  RDMA (Future I/O and NGIO)
  • 7. Direct NIC Integration  Sun Niagara CPU [8 core 64 threads]  Dual 10GbE on-die  Released in 2007
  • 8. Super-Computers I/O Processors with I/O Transaction Forwarding  This approach leads to NGIO + Future IO => Infiniband & iWARP RDMA  Connection context offloading (similar to TOE architectures)  “30 minutes to print `Hello world`”  http://www.mcs.anl.gov/papers/P1594A.pdf
  • 10. iDMA Transaction Operations iDMA transmit iDMA receive
  • 11. iDMA Latency and Throughput Benefits Total critical path latency = TxSW + TxHW + fiber + RxHW + RxSW
  • 13. Hot Potato  After survey of I/O transactions and acceleration functions, we chose to stay with looking at descriptors.  Treat the payload data as a Hot Potato
  • 14. Hot Potato Motivation: Legacy NIC Internal Design Transmit I/O Receive I/O
  • 15. Write-Combining Buffers CPU Core 64B WC buffers PCIe packet 24B header 64B payloadfull full full full full full [Myricom experiments]
  • 18. PCIe Packet Framing To discuss further we need to dig into PCIe protocol details:
  • 19. Typical ICMP Ping Sequence Doorbell write Descriptor read ICMP packet IP address “C0A80001”
  • 21. Hot-Potato Latency and Throughput Benefit
  • 22. Measurements and Conclusions  1.5us latency reduction in benchmark tests  8% latency reduction in real memcached application
  • 23. Device2Device (D2D)  Shift gears from CPU I/O transactions to inter- device communication.
  • 24. CPU System Memory Receiver Device (e.g., NIC) TX Queues RX Queues Sender Device (e.g., SSD) Disk Write Queues Disk Read Queues DDR3 Kernel buffer for disk data Kernel buffer for network packet data 3 2 1 Legacy Video Streaming from Storage
  • 25. Details of a D2D NIC D2D-enabled NIC Transmit Packet Queue(s) TX PHY Receive Packet Queue(s) RX PHY Legacy Rx DMA UOE / Packet-Based Priority Control PCIe BARx Memory Space PCIe TLP/DLP/LLP processing Legacy NIC Control Registers D2D UOE Control Registers Legacy Tx DMA UOE / Parse Control D2D Rx FSM D2D Rx Queue + Frame Header D2D Tx Queue + Frame Header D2D Tx FSM Optional D2D Flow Control Registers PCIe PHY
  • 26. Details of a D2D SSD D2D-enabled SSD Legacy Tx DMA PCIe TLP/DLP/LLP processing Legacy Rx DMA Optional SSD SSD Controller and Buffers Legacy NIC Control Registers D2D Flow Control Registers PCIe BARx Memory Space D2D Tx FSMD2D Rx FSM PCIe PHY D2D Tx QueueD2D Rx Queue Modified Rx DMA Modified Tx DMA
  • 27. D2D Transmit FSM  INIT state: Sets Tx Flow Control registers ◦ Tx Address register ◦ D2D Transmit Byte Count register ◦ Data Rate and Granularity parameters ◦ Tx and Rx Base Credits  Parse state: ◦ Map OS block addresses to SSD physical addresses ◦ Enqueued in D2D Tx Queue  Send state: ◦ Fetch and forward SSD block data to PCIe interface.  Wait state: ◦ Waits until the next chunk needs to be sent. ◦ Depends on Data Rate and Granularity  Check state: ◦ Checks whether bytes sent < D2D Transmit Byte Count Check Idle Init Parse Send Wait True False
  • 28. D2D Receive FSM (with UOE)  INIT state: Set D2D Flow & UOE Control registers ◦ MAC source (SSD) and destination (NIC) addresses. ◦ Source and destination IP addresses. ◦ UDP source and destination port, length, and checksum. ◦ RTP version, sequence #, and timestamp.  Fetch state: ◦ Monitor D2D Rx Queue for data.  Frame state: ◦ Assign static fields: MAC & IP addresses.  Calc state: ◦ Assign Ethernet length & CRC, IP length & checksum, UDP length & checksum, RTP timestamp & sequence #. ◦ Enqueue in Tx Packet Queue  Send state: ◦ Send to MAC layer for transmission Idle Init Fetch Frame Send Calc
  • 29. NetFPGA Logical Architecture Xilinx Virtex-5 TX240T FPGA, 10GbE, and memories AXI Lite AMB AXI-Stream Interface (160 MHz, 64-bit) DMA Engine Registers nf0 nf1 nf2 nf3 ioctl MA C TxQ MA C RxQ Ethernet MA C TxQ MA C RxQ MA C TxQ MA C RxQ MAC TxQ MAC RxQ Shared PCIe interface PCIe Interface Layer Software Interface Layer NetFPGA Internal Control and Routing D2D Tx/Rx Queues D2D Tx/Rx FSMs Legacy DMA RxQ Legacy DMA Tx Q Read/write from/to D2D control registers Control & Status Interface Data Path Interface
  • 31. NIC - to - Video
  • 32. D2D VoD streaming configuration PCIe Interface PCIe Bridge CPU System Memory PCIe Interface Incoming network data is treated as storage data to be written to D2D-enabled NIC System Output Display (SOD) Out-bound device: UDP VoD stream Stimulus System (SS) In-bound device: Emulated SSD stream System Under Test (SUT) PCIe Interface D2D-enabled NIC Transmit Packet Queue(s) TX PHY Receive Packet Queue(s) RX PHY Legacy RX DMA PCIe BARx Memory Space PCIe TLP/DLP/LLP processing Legacy NIC Control Registers D2D Flow Control Registers D2D UOE Control Registers Legacy TX DMA PCIe PHY UOE / Parse Control D2D Rx FSM D2D Rx Queue + Frame Header D2D Tx Queue + Frame Header D2D Tx FSM D2D-enabled NIC Transmit Packet Queue(s) TX PHY Receive Packet Queue(s) RX PHY Legacy RX DMA UOE / Packet-Based Priority Control PCIe BARx Memory Space PCIe TLP/DLP/LLP processing Legacy NIC Control Registers D2D Flow Control Registers D2D UOE Control Registers Legacy TX DMA PCIe PHY UOE / Parse Control D2D Rx FSM D2D Rx Queue + Frame Header D2D Tx Queue + Frame Header D2D Tx FSM UOE / Packet-Based Priority Control
  • 33. D2D Physical Configuration System Under Test (SUT) Shared KVM System Output Display (SOD) 1000 Watt SUT power supply DMM measuring CPU+CPU VRM 12V current Stimulus System (SS) FPGA programmer
  • 34. SUT Configuration SSD Linux boot drive PCIe x8 bridge Fan to add cooling to FPGA fans Emulated SSD NetFPGA Spliced power supply to CPU for ammeter D2D NIC NetFPGA Xilinx USB programmer for FPGA and Chipscope Intel 2500 4- core 3.1GHz CPU 2GB 1333MHz DDR3
  • 35. D2D Latency (1500 byte packet)
  • 36. D2D Power and Utilization benefit
  • 38. Conclusions  CPU-based descriptor DMA makes sense in the context of off-loading slow I/O devices when additional overhead was small relative to overall latency, power, throughput  This work proposes small additional changes in hardware and software that bypass this descriptor overhead.  Depending on the application I/O transaction profile, benefits in latency, throughput, and power are significant.
  • 40. Non-Transparent Bridging (NTB) Host BHost A Device Device BARTranslate BAR Translate PCIe switch Device Device BARTranslate BAR Translate PCIe switch Host B memory write to host A Host A memory write to host B
  • 41. Basic video frame buffer format  4 bytes defined per pixel.  Frame buffer mapped to linear system memory space  FPGA writes in bit level compatible format  Verified with PCIe trace analyzer Video screen Target stream space (i.e. 640x480) Pixel information (4B per pixel) [0x0] [0x1] [0x2] [0x3] transparency pixel
  • 42. D2D video stream timeline Time SOD SUT SS VoD server configuration SOD configures D2D stream configuration in SUT SOD requests UDP video stream on specific UDP port from SS (emulated SSD) SS begins streaming video to SUT SUT UOE strips packet header and passes to D2D TX queue SUT UOE frames new packet to the SOD SOD decodes video frames and displays Repeated pipelined VoD packets End of video stream, or SOD termination.

Hinweis der Redaktion

  1. Emphasize data movement PCIe prevalence AKA UPI or KTI
  2. http://www.verien.com/picts/pcie_packet_encapsulation.png