1. PFQ: a Novel Architecture for Packet
Capture on Parallel Commodity
Hardware
Nicola Bonelli, Andrea Di Pietro,
Stefano Giordano, Gregorio Procissi
CNIT e Dip. di Ingegneria dell’Informazione - Università di Pisa
2. Outline
• Introduction and motivation
• Multi-core programming guidelines
• PFQ architecture
• Performance evaluation
• Conclusion and future work
3. Introduction and Motivations
• Designing monitoring applications has become a very challenging task:
– The hardware has evolved: 10Gbits links, multi-core architectures and multi-
queue network devices (MSI-X)…
• The present software for traffic monitoring, including some parts of the
Linux kernel, is not optimized for new hardware
– (+) kernel support for multi-queue network adapters is implemented
– (-) Linux kernel has a very bad support for monitoring applications
– (-) PF_PACKET is extremely slow, even when used in memory-map mode (pcap)
– (-) PF_RING has been designed for single-processor systems
• Traffic monitoring should:
– Exploits modern hardware, scaling possibly linearly with the number of cores
– Decouple the hardware parallelism from the software one
– Divide and conquer approach to steer packets to applications or threads
4. Multi-thread on Multi-core
• What’s wrong with the current software?
– Previous multi-threading paradigms used for single-processor systems are still
valid, but prevent the software from scaling with the number of cores.
• For a software to be effective on multi-core system…
– Semaphores, mutexes, and spinlocks are out of question!
– R/W mutexes prevent readers from scaling, even though they are supposed to
grant concurrent access to readers
– Atomic operations are sometimes required, but must be used with
moderation
• sparse-counters instead of atomic ones
• design algorithm as they can use amortized atomic operations
– Sharing (writes to shared data) has serious impact on performance
– writes to shared memory are delayed by the hardware, reads must be synchronized
– False-sharing must and can always be avoided
• wait-free algorithms are mandatory, use lock-free algorithm should be
avoided (if possible)…
5. PFQ preamble
• PFQ is a novel capture system natively supporting 64bit multi-core
architectures written on top of all the previously exposed
guidelines
• PFQ is not a custom driver
• It is an architecture running on top of standard Ethernet drivers, as
well as slightly modified ones “PFQ aware drivers” (PF_RING aware
driver inheritance)
• PFQ enables packet capturing, filtering, hw queues and devices
aggregation, packet classifications, packet steering and so forth…
• Decouples the hardware parallelism (i.e. Intel RSS) from the
software one
6. PFQ architecture
Built on the top of the following components…
• User-space C++11 library that provides the same abstraction as that of the STL:
container and iterators
• DB-MPSC queue: double-buffered multiple-producers queue (for the
communication to user-space):
– Allows NAPI contexts to enqueue packets concurrently
– Reduce the sharing, eliminate the false sharing between user-space and NAPI contexts
– Enables user-space copies of packets from the queue to a private buffer in a batch fashion
• De-multiplexing Matrix:
– perfect wait-free concurrently accessible data structure
– no serialization is required to steer/copy packets
• SPSC queue:
– enables batching for socket buffers (skb), to increase temporal locality for the memory
manager (SLAB for kernel prior to 2.6.39)
• Driver aware:
– an effective idea inherited from PF_RING
8. Packet steering
Given a packet and a set of sockets, which sockets need to receive it?
• For capture engines that do not support it, filtering can be used to
dispatch packets across a number of sockets:
– Traversing the socket list to find those interested in the packet has
linear complexity O(n).
– Flexible approach because it enables dispatching as well as copies
• We designed a “packet steering” paradigm that:
– O(1) complexity to identify the destination sockets
– Support both balancing and copies of packets
– Custom hash functions for packet dispatching
9. Packet steering
• Completely concurrent block (wait-free):
– Shared state (de-multiplexing matrix) is mostly read only
– Writes, which are in general rare events, are serialized each other to prevent
race conditions. The update of the state in the matrix is atomic
• Load balancing groups:
– A socket can create or subscribe a load-balancing group
– It will receive a fraction of the overall traffic
• Socket binding
– One or more hardware queues of a given NIC
– One or more NICs
• Binding and balancing groups are orthogonal and can be concurrently
used
10. Socket queue: DB-MPSC
• The queue of socket is an unavoidable contention point:
– Load balancing shuffles packets across sockets
• How handle contention without impacting the performance?
– Use an atomic operation to reserve a slot within the queue (will be amortized
in future implementations)
– Reduce traffic coherence among the cores running k-thread and user-space
thread
– Swap between buffers is triggered by user-space thread or by water-mark
– Packets can be copied in batch fashion, or consumed in-place
11. Testbed: Mascara & Monsters
Mascara Monsters
10 Gb link
Xeon 6-core X5650, @2.57 GHz,
12GBytes RAM
New socket PF_DIRECT for generation
Intel 82599 multi-queue 10G ethernet
adapter.
By deploying 3-4 cores, it is possible to
generate up to ~12 Mpps of 64 bytes.
Xeon 6-core X5650 @2.57GHz, 12
GBytes RAM
Intel 82599 multi-queue 10G ethernet
adapter
PFQ on board for traffic capture
14. Load balancing across sockets
• Using 12 capturing NAPI
• Varying the number of user space threads
15. Packet copy
• Copying packets to a variable number of user space threads
• 12 NAPI contexts within the kernel
16. Future directions
We are working to improve the packet steering framework…
• How can we better distribute packets according to application-
specific semantics?
• Enhance balancing groups, allow a single socket to join multiple
balancing groups
• Each group is associated with a “specific steering function”
• Investigating on the implementation for wait-free stateful algorithm
(pimp/CAS)
• Add the support of control- and data-plane socket
• Implement a filtering mechanism by means of some bloom filter
variant (capture filters)
17. Conclusions
• Modern commodity architectures are increasingly parallel
• Multithread software is today not ready for multi-core
architectures:
• Need to strictly fulfill coding and design rules to achieve linear
scalability
• PFQ: a novel Linux packet capturing engine
– Better scalability with respect to competitors
– Flexible packet steering that eases the implementation of multi-
thread user-space applications
– Decouples kernel space and user space parallelism
• PFQ webpage and download:
– netgroup.iet.unipi.it/software/pfq