SlideShare ist ein Scribd-Unternehmen logo
1 von 17
PFQ: a Novel Architecture for Packet
Capture on Parallel Commodity
Hardware
Nicola Bonelli, Andrea Di Pietro,
Stefano Giordano, Gregorio Procissi
CNIT e Dip. di Ingegneria dell’Informazione - Università di Pisa
Outline
• Introduction and motivation
• Multi-core programming guidelines
• PFQ architecture
• Performance evaluation
• Conclusion and future work
Introduction and Motivations
• Designing monitoring applications has become a very challenging task:
– The hardware has evolved: 10Gbits links, multi-core architectures and multi-
queue network devices (MSI-X)…
• The present software for traffic monitoring, including some parts of the
Linux kernel, is not optimized for new hardware
– (+) kernel support for multi-queue network adapters is implemented
– (-) Linux kernel has a very bad support for monitoring applications
– (-) PF_PACKET is extremely slow, even when used in memory-map mode (pcap)
– (-) PF_RING has been designed for single-processor systems
• Traffic monitoring should:
– Exploits modern hardware, scaling possibly linearly with the number of cores
– Decouple the hardware parallelism from the software one
– Divide and conquer approach to steer packets to applications or threads
Multi-thread on Multi-core
• What’s wrong with the current software?
– Previous multi-threading paradigms used for single-processor systems are still
valid, but prevent the software from scaling with the number of cores.
• For a software to be effective on multi-core system…
– Semaphores, mutexes, and spinlocks are out of question!
– R/W mutexes prevent readers from scaling, even though they are supposed to
grant concurrent access to readers
– Atomic operations are sometimes required, but must be used with
moderation
• sparse-counters instead of atomic ones
• design algorithm as they can use amortized atomic operations
– Sharing (writes to shared data) has serious impact on performance
– writes to shared memory are delayed by the hardware, reads must be synchronized
– False-sharing must and can always be avoided
• wait-free algorithms are mandatory, use lock-free algorithm should be
avoided (if possible)…
PFQ preamble
• PFQ is a novel capture system natively supporting 64bit multi-core
architectures written on top of all the previously exposed
guidelines
• PFQ is not a custom driver
• It is an architecture running on top of standard Ethernet drivers, as
well as slightly modified ones “PFQ aware drivers” (PF_RING aware
driver inheritance)
• PFQ enables packet capturing, filtering, hw queues and devices
aggregation, packet classifications, packet steering and so forth…
• Decouples the hardware parallelism (i.e. Intel RSS) from the
software one
PFQ architecture
Built on the top of the following components…
• User-space C++11 library that provides the same abstraction as that of the STL:
container and iterators
• DB-MPSC queue: double-buffered multiple-producers queue (for the
communication to user-space):
– Allows NAPI contexts to enqueue packets concurrently
– Reduce the sharing, eliminate the false sharing between user-space and NAPI contexts
– Enables user-space copies of packets from the queue to a private buffer in a batch fashion
• De-multiplexing Matrix:
– perfect wait-free concurrently accessible data structure
– no serialization is required to steer/copy packets
• SPSC queue:
– enables batching for socket buffers (skb), to increase temporal locality for the memory
manager (SLAB for kernel prior to 2.6.39)
• Driver aware:
– an effective idea inherited from PF_RING
PFQ architecture
Packet steering
Given a packet and a set of sockets, which sockets need to receive it?
• For capture engines that do not support it, filtering can be used to
dispatch packets across a number of sockets:
– Traversing the socket list to find those interested in the packet has
linear complexity O(n).
– Flexible approach because it enables dispatching as well as copies
• We designed a “packet steering” paradigm that:
– O(1) complexity to identify the destination sockets
– Support both balancing and copies of packets
– Custom hash functions for packet dispatching
Packet steering
• Completely concurrent block (wait-free):
– Shared state (de-multiplexing matrix) is mostly read only
– Writes, which are in general rare events, are serialized each other to prevent
race conditions. The update of the state in the matrix is atomic
• Load balancing groups:
– A socket can create or subscribe a load-balancing group
– It will receive a fraction of the overall traffic
• Socket binding
– One or more hardware queues of a given NIC
– One or more NICs
• Binding and balancing groups are orthogonal and can be concurrently
used
Socket queue: DB-MPSC
• The queue of socket is an unavoidable contention point:
– Load balancing shuffles packets across sockets
• How handle contention without impacting the performance?
– Use an atomic operation to reserve a slot within the queue (will be amortized
in future implementations)
– Reduce traffic coherence among the cores running k-thread and user-space
thread
– Swap between buffers is triggered by user-space thread or by water-mark
– Packets can be copied in batch fashion, or consumed in-place
Testbed: Mascara & Monsters
Mascara Monsters
10 Gb link
Xeon 6-core X5650, @2.57 GHz,
12GBytes RAM
New socket PF_DIRECT for generation
Intel 82599 multi-queue 10G ethernet
adapter.
By deploying 3-4 cores, it is possible to
generate up to ~12 Mpps of 64 bytes.
Xeon 6-core X5650 @2.57GHz, 12
GBytes RAM
Intel 82599 multi-queue 10G ethernet
adapter
PFQ on board for traffic capture
Single socket layout
Fully parallel layout
Load balancing across sockets
• Using 12 capturing NAPI
• Varying the number of user space threads
Packet copy
• Copying packets to a variable number of user space threads
• 12 NAPI contexts within the kernel
Future directions
We are working to improve the packet steering framework…
• How can we better distribute packets according to application-
specific semantics?
• Enhance balancing groups, allow a single socket to join multiple
balancing groups
• Each group is associated with a “specific steering function”
• Investigating on the implementation for wait-free stateful algorithm
(pimp/CAS)
• Add the support of control- and data-plane socket
• Implement a filtering mechanism by means of some bloom filter
variant (capture filters)
Conclusions
• Modern commodity architectures are increasingly parallel
• Multithread software is today not ready for multi-core
architectures:
• Need to strictly fulfill coding and design rules to achieve linear
scalability
• PFQ: a novel Linux packet capturing engine
– Better scalability with respect to competitors
– Flexible packet steering that eases the implementation of multi-
thread user-space applications
– Decouples kernel space and user space parallelism
• PFQ webpage and download:
– netgroup.iet.unipi.it/software/pfq

Weitere ähnliche Inhalte

Was ist angesagt?

Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
madhuinturi
 
Tempesta FW: a FrameWork and FireWall for HTTP DDoS mitigation and Web Applic...
Tempesta FW: a FrameWork and FireWall for HTTP DDoS mitigation and Web Applic...Tempesta FW: a FrameWork and FireWall for HTTP DDoS mitigation and Web Applic...
Tempesta FW: a FrameWork and FireWall for HTTP DDoS mitigation and Web Applic...
Alexander Krizhanovsky
 

Was ist angesagt? (20)

Evolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorEvolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO Visor
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
 
File Systems: Why, How and Where
File Systems: Why, How and WhereFile Systems: Why, How and Where
File Systems: Why, How and Where
 
Bgpcep odl summit 2015
Bgpcep odl summit 2015Bgpcep odl summit 2015
Bgpcep odl summit 2015
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
P4 to OpenDataPlane Compiler - BUD17-304
P4 to OpenDataPlane Compiler - BUD17-304P4 to OpenDataPlane Compiler - BUD17-304
P4 to OpenDataPlane Compiler - BUD17-304
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use Cases
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509
 
Asymmetric Multiprocessing - Kynetics ELC 2018 portland
Asymmetric Multiprocessing - Kynetics ELC 2018 portlandAsymmetric Multiprocessing - Kynetics ELC 2018 portland
Asymmetric Multiprocessing - Kynetics ELC 2018 portland
 
Mahti quick-start guide
Mahti quick-start guide Mahti quick-start guide
Mahti quick-start guide
 
Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7
 
Foss Gadgematics
Foss GadgematicsFoss Gadgematics
Foss Gadgematics
 
Run Your Own 6LoWPAN Based IoT Network
Run Your Own 6LoWPAN Based IoT NetworkRun Your Own 6LoWPAN Based IoT Network
Run Your Own 6LoWPAN Based IoT Network
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
 
Tempesta FW: a FrameWork and FireWall for HTTP DDoS mitigation and Web Applic...
Tempesta FW: a FrameWork and FireWall for HTTP DDoS mitigation and Web Applic...Tempesta FW: a FrameWork and FireWall for HTTP DDoS mitigation and Web Applic...
Tempesta FW: a FrameWork and FireWall for HTTP DDoS mitigation and Web Applic...
 
Smart logic
Smart logicSmart logic
Smart logic
 
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinksVSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
 
BUD17-300: Journey of a packet
BUD17-300: Journey of a packetBUD17-300: Journey of a packet
BUD17-300: Journey of a packet
 
Programming Trends in High Performance Computing
Programming Trends in High Performance ComputingProgramming Trends in High Performance Computing
Programming Trends in High Performance Computing
 
Lucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFLucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoF
 

Andere mochten auch

Andere mochten auch (9)

PFQ@ 10th Italian Networking Workshop (Bormio)
PFQ@ 10th Italian Networking Workshop (Bormio)PFQ@ 10th Italian Networking Workshop (Bormio)
PFQ@ 10th Italian Networking Workshop (Bormio)
 
Cat's anatomy
Cat's anatomyCat's anatomy
Cat's anatomy
 
Types, classes and concepts
Types, classes and conceptsTypes, classes and concepts
Types, classes and concepts
 
Netmap presentation
Netmap presentationNetmap presentation
Netmap presentation
 
DPDK KNI interface
DPDK KNI interfaceDPDK KNI interface
DPDK KNI interface
 
Understanding DPDK algorithmics
Understanding DPDK algorithmicsUnderstanding DPDK algorithmics
Understanding DPDK algorithmics
 
Vagrant
VagrantVagrant
Vagrant
 
Userspace networking
Userspace networkingUserspace networking
Userspace networking
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 

Ähnlich wie PFQ@ PAM12

Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
Haris456
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
IO Visor Project
 
Microx - A Unix like kernel for Embedded Systems written from scratch.
Microx - A Unix like kernel for Embedded Systems written from scratch.Microx - A Unix like kernel for Embedded Systems written from scratch.
Microx - A Unix like kernel for Embedded Systems written from scratch.
Waqar Sheikh
 
Oracle rac 10g best practices
Oracle rac 10g best practicesOracle rac 10g best practices
Oracle rac 10g best practices
Haseeb Alam
 

Ähnlich wie PFQ@ PAM12 (20)

Walk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCWalk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoC
 
Juniper Networks Router Architecture
Juniper Networks Router ArchitectureJuniper Networks Router Architecture
Juniper Networks Router Architecture
 
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
Making workload nomadic when accelerated
Making workload nomadic when acceleratedMaking workload nomadic when accelerated
Making workload nomadic when accelerated
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles Shiflett
 
Fastsocket Linxiaofeng
Fastsocket LinxiaofengFastsocket Linxiaofeng
Fastsocket Linxiaofeng
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
VLSI design Dr B.jagadeesh UNIT-5.pptx
VLSI design Dr B.jagadeesh   UNIT-5.pptxVLSI design Dr B.jagadeesh   UNIT-5.pptx
VLSI design Dr B.jagadeesh UNIT-5.pptx
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
Distributed Clouds and Software Defined Networking
Distributed Clouds and Software Defined NetworkingDistributed Clouds and Software Defined Networking
Distributed Clouds and Software Defined Networking
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
 
Microx - A Unix like kernel for Embedded Systems written from scratch.
Microx - A Unix like kernel for Embedded Systems written from scratch.Microx - A Unix like kernel for Embedded Systems written from scratch.
Microx - A Unix like kernel for Embedded Systems written from scratch.
 
ODP Presentation LinuxCon NA 2014
ODP Presentation LinuxCon NA 2014ODP Presentation LinuxCon NA 2014
ODP Presentation LinuxCon NA 2014
 
Oracle rac 10g best practices
Oracle rac 10g best practicesOracle rac 10g best practices
Oracle rac 10g best practices
 

Kürzlich hochgeladen

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 

Kürzlich hochgeladen (20)

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 

PFQ@ PAM12

  • 1. PFQ: a Novel Architecture for Packet Capture on Parallel Commodity Hardware Nicola Bonelli, Andrea Di Pietro, Stefano Giordano, Gregorio Procissi CNIT e Dip. di Ingegneria dell’Informazione - Università di Pisa
  • 2. Outline • Introduction and motivation • Multi-core programming guidelines • PFQ architecture • Performance evaluation • Conclusion and future work
  • 3. Introduction and Motivations • Designing monitoring applications has become a very challenging task: – The hardware has evolved: 10Gbits links, multi-core architectures and multi- queue network devices (MSI-X)… • The present software for traffic monitoring, including some parts of the Linux kernel, is not optimized for new hardware – (+) kernel support for multi-queue network adapters is implemented – (-) Linux kernel has a very bad support for monitoring applications – (-) PF_PACKET is extremely slow, even when used in memory-map mode (pcap) – (-) PF_RING has been designed for single-processor systems • Traffic monitoring should: – Exploits modern hardware, scaling possibly linearly with the number of cores – Decouple the hardware parallelism from the software one – Divide and conquer approach to steer packets to applications or threads
  • 4. Multi-thread on Multi-core • What’s wrong with the current software? – Previous multi-threading paradigms used for single-processor systems are still valid, but prevent the software from scaling with the number of cores. • For a software to be effective on multi-core system… – Semaphores, mutexes, and spinlocks are out of question! – R/W mutexes prevent readers from scaling, even though they are supposed to grant concurrent access to readers – Atomic operations are sometimes required, but must be used with moderation • sparse-counters instead of atomic ones • design algorithm as they can use amortized atomic operations – Sharing (writes to shared data) has serious impact on performance – writes to shared memory are delayed by the hardware, reads must be synchronized – False-sharing must and can always be avoided • wait-free algorithms are mandatory, use lock-free algorithm should be avoided (if possible)…
  • 5. PFQ preamble • PFQ is a novel capture system natively supporting 64bit multi-core architectures written on top of all the previously exposed guidelines • PFQ is not a custom driver • It is an architecture running on top of standard Ethernet drivers, as well as slightly modified ones “PFQ aware drivers” (PF_RING aware driver inheritance) • PFQ enables packet capturing, filtering, hw queues and devices aggregation, packet classifications, packet steering and so forth… • Decouples the hardware parallelism (i.e. Intel RSS) from the software one
  • 6. PFQ architecture Built on the top of the following components… • User-space C++11 library that provides the same abstraction as that of the STL: container and iterators • DB-MPSC queue: double-buffered multiple-producers queue (for the communication to user-space): – Allows NAPI contexts to enqueue packets concurrently – Reduce the sharing, eliminate the false sharing between user-space and NAPI contexts – Enables user-space copies of packets from the queue to a private buffer in a batch fashion • De-multiplexing Matrix: – perfect wait-free concurrently accessible data structure – no serialization is required to steer/copy packets • SPSC queue: – enables batching for socket buffers (skb), to increase temporal locality for the memory manager (SLAB for kernel prior to 2.6.39) • Driver aware: – an effective idea inherited from PF_RING
  • 8. Packet steering Given a packet and a set of sockets, which sockets need to receive it? • For capture engines that do not support it, filtering can be used to dispatch packets across a number of sockets: – Traversing the socket list to find those interested in the packet has linear complexity O(n). – Flexible approach because it enables dispatching as well as copies • We designed a “packet steering” paradigm that: – O(1) complexity to identify the destination sockets – Support both balancing and copies of packets – Custom hash functions for packet dispatching
  • 9. Packet steering • Completely concurrent block (wait-free): – Shared state (de-multiplexing matrix) is mostly read only – Writes, which are in general rare events, are serialized each other to prevent race conditions. The update of the state in the matrix is atomic • Load balancing groups: – A socket can create or subscribe a load-balancing group – It will receive a fraction of the overall traffic • Socket binding – One or more hardware queues of a given NIC – One or more NICs • Binding and balancing groups are orthogonal and can be concurrently used
  • 10. Socket queue: DB-MPSC • The queue of socket is an unavoidable contention point: – Load balancing shuffles packets across sockets • How handle contention without impacting the performance? – Use an atomic operation to reserve a slot within the queue (will be amortized in future implementations) – Reduce traffic coherence among the cores running k-thread and user-space thread – Swap between buffers is triggered by user-space thread or by water-mark – Packets can be copied in batch fashion, or consumed in-place
  • 11. Testbed: Mascara & Monsters Mascara Monsters 10 Gb link Xeon 6-core X5650, @2.57 GHz, 12GBytes RAM New socket PF_DIRECT for generation Intel 82599 multi-queue 10G ethernet adapter. By deploying 3-4 cores, it is possible to generate up to ~12 Mpps of 64 bytes. Xeon 6-core X5650 @2.57GHz, 12 GBytes RAM Intel 82599 multi-queue 10G ethernet adapter PFQ on board for traffic capture
  • 14. Load balancing across sockets • Using 12 capturing NAPI • Varying the number of user space threads
  • 15. Packet copy • Copying packets to a variable number of user space threads • 12 NAPI contexts within the kernel
  • 16. Future directions We are working to improve the packet steering framework… • How can we better distribute packets according to application- specific semantics? • Enhance balancing groups, allow a single socket to join multiple balancing groups • Each group is associated with a “specific steering function” • Investigating on the implementation for wait-free stateful algorithm (pimp/CAS) • Add the support of control- and data-plane socket • Implement a filtering mechanism by means of some bloom filter variant (capture filters)
  • 17. Conclusions • Modern commodity architectures are increasingly parallel • Multithread software is today not ready for multi-core architectures: • Need to strictly fulfill coding and design rules to achieve linear scalability • PFQ: a novel Linux packet capturing engine – Better scalability with respect to competitors – Flexible packet steering that eases the implementation of multi- thread user-space applications – Decouples kernel space and user space parallelism • PFQ webpage and download: – netgroup.iet.unipi.it/software/pfq