The document summarizes lessons learned from building a real-time network traffic analyzer in C/C++. Key points include:
- Libpcap was used for traffic capturing as it is cross-platform, supports PF_RING, and has a relatively easy API.
- SQLite was used for data storage due to its small footprint, fast performance, embeddability, SQL support, and B-tree indexing.
- A producer-consumer model with a blocking queue was implemented to handle packet processing in multiple threads.
- Memory pooling helped address performance issues caused by excessive malloc calls during packet aggregation.
- Custom spin locks based on atomic operations improved performance over mutexes on FreeBSD/
Automating Google Workspace (GWS) & more with Apps Script
Realtime traffic analyser
1. Lessons we learned while building
real-time network traffic analyzer in
C/C++
Alex Moskvin
CEO/CTO @ Plexteq
2. About myself
• CEO/CTO Plexteq OÜ
• Ph.D in information technology area
• Interests
• Software architecture
• High loaded systems
• Everything under the hood
• AI/ML + BigData
• Knowledge sharing ;)
• Follow me
• https://twitter.com/amoskvin
• https://www.facebook.com/moskvin.aleksey
3. Plexteq
• High loaded backends
• Complex distributed data processing
pipelines
• Big Data / BI
• We have our custom products
(hardware + software solutions)
We are hiring! ;)
4. Agenda
1. What was the whole stuff about
2. How we decided to solve it
3. Challenges we faced
4. Lessons we learned
6. Task definition
• Network services provider needs:
• Analyse threats/interactions in past
• Realtime network spikes indication
• Aggregate metadata from hundreds of systems
• Solution should be
• fast, resource efficient (no CPU/RAM hogging)
• potentially needs to be cross-platform
• Easy to integrate with ETL and BI systems
• Regular bandwidth: 100-1000Mbps
7. Data model
2 dimensions
Per port
Time period
Source IP
Destination port
Protocol type
In bytes
Out bytes
In packets
Out packets
Per protocol type
Time period
TCP/UDP/… traffic in
bytes
TCP/UDP/… traffic in
bytes
Protocol type
In bytes
Out bytes
In packets
Out packets
12. Existing solutions
$ tcpdump -i eth0
$ tcpdump tcp port 443
$ tcpdump tcp ‘port 443 or port 80’
$ tcpdump tcp ‘port 443 or port 80’ -w out-file
13. Existing solutions
• Drawbacks
• tcpdump / wireshark
• Single threaded
• Large disk space overhead (without hacking will write packet contents)
• Not possible to write with custom data format (extra parsing efforts of .pcap file is
needed)
• Iptables
• Could work, but will be hard to customize in case of further feature requests
• Not cross-platform
18. Traffic capturing :: Raw sockets
Drawbacks:
• Kernel-to-userspace copies
• Developer needs to be proficient with
packet structure and low level
networking semantics, i.e. endianness
20. Traffic capturing :: pf_ring
PF_RING – kernel bypass
Motivation:
• Kernel is very slow
• Vanilla kernel can handle 1-2Mpps
• PF_RING can do 15+Mpps on commodity hardware
Pros
• Huge workloads
• Could be used for network server application development
• Zero copy technique
Cons
• Complicated API
• Support on network card driver level is preferred
• PF_RING ZC API is complex
• Not cross platform
21. Traffic capturing :: 3rd party libs
Pros:
• Cross platform
• May utilize low level OS dependent optimizations and extensions, i.e. PF_RING
22. Traffic capturing :: winner
libpcap
• Cross platform
• Supports PF_RING
• The most fast implementation
• Well maintained
• Relatively easy API
25. Solutions to store data
We wanted something that:
• Has small footprint and fast
• Preferably one file database
• Embeddable
• Supports SQL
• Supports B-tree indices
26. Solutions to store data
We wanted something that:
• Has small footprint and fast
• Preferably one file database
• Embeddable
• Supports SQL
• Supports B-tree indices
27. Solutions to store data
We wanted something that:
• Has small footprint and fast
• Preferably one file database
• Embeddable
• Supports SQL
• Supports B-tree indices
Drawbacks:
• Single threaded – we need to synchronize/serialize write ops to it in our
application
32. Producer-consumer problem
• Issues:
• Aggregator is not following up on traffic > 25Mbps
• We have a significant increasing delay between incoming traffic and flushed
stats
This is actually a producer-consumer type of problem
35. Producer-consumer problem
• Solution:
• Producer runs in separate thread
• Multiple consumers that run in separate threads
Possible implementations:
• Message broker
• Blocking queue
37. Producer-consumer problem
Very good implementation: APR (Apache Portable Runtime)
Used by Apache web server
http://apr.apache.org/docs/apr-util/1.3/apr__queue_8h.html
39. Packet processing flow
• Issues:
• Application is capable to handle about 82Mbps of traffic flow
• CPU usage is 100+% utilized by our app (eaten by malloc calls)
40. Memory allocation
• Issues:
• Application is capable to handle about 82Mbps of traffic flow
• CPU usage is 100% utilized by our app (eaten by malloc calls)
• Business logic needed at least 1 malloc when packet stats got aggregated in in-
memory data structure
42. Malloc issue
Solution:
• Use memory pooling
Blockpre-allocate
withmalloc
Allocations within a block
(eventually allocation within block = pointer arithmetic)
43. Malloc issue
Solution:
• Use memory pooling
Blockpre-allocate
withmalloc
Allocations within a block
(eventually allocation within block = pointer arithmetic)
Drawbacks:
• Can’t do free for an individual
allocation within a block
46. Mutexes
• Results:
• Linux:
• Application is capable to handle ~1Gbps of traffic flow
• CPU usage is 10-15% on 4 core Xeon 2.8Ghz
• FreeBSD/OSX
• Application is capable to handle ~615Mbps of traffic flow
• CPU usage is 35% on 4 core Xeon 2.8Ghz
47. Mutexes
• Results:
• Linux:
• Application is capable to handle ~1Gbps of traffic flow
• CPU usage is 10-15% on 4 core Xeon 2.8Ghz
• FreeBSD/OSX
• Application is capable to handle ~615Mbps of traffic flow
• CPU usage is 35% on 4 core Xeon 2.8Ghz
• Possible reasons
• Profiler shows a high number of thread synchronization calls from our app
(pthread_mutex_lock, pthread_mutex_unlock)
48. Mutexes
• Investigation:
• pthread_mutex_* in Linux is implemented using futexes (fast user-space
mutex), no locking, no context switching
• POSIX is a standard, it doesn’t require specific implementation
• OSX/FreeBSD use heavier approach with
50. Mutexes
• Thread synhronization approaches:
• Lock based
• Semaphore
• Mutex
• Lock free
• Futex (could lock in an edge case)
• Spin lock
• CAS based spin lock
51. Mutexes
• Our target critical section:
• No IO operations
• Just pointer operations, arithmetic operations and allocations on memory
pool
• Options
• Spin lock from OS
• Custom spin lock based on CAS operations
52. Mutexes
• Our target critical section:
• No IO operations
• Just pointer operations, arithmetic operations and allocations on memory
pool
• Options
• Spin lock from OS
• pthread_spin_lock
• Custom spin lock based on CAS operations
• GCC atomic built ins
• __sync_lock_test_and_set
• __sync_lock_release
54. Mutexes
1) volatile suggests that “lock”may be changed by other threads
2) __sync_lock_test_and_set, __sync_lock_release
Are atomic built ins which guarantee atomic memory access
3) __sync_lock_test_and_set atomically sets 1 and returns 0
4) If lock == 1, we keep looping until another thread calls
__sync_lock_release
55. Mutexes
• Results:
• Linux:
• Application is capable to handle ~1Gbps of traffic flow
• CPU usage is 10-15% on 4 core Xeon 2.8Ghz
• FreeBSD/OSX
• Application is capable to handle ~1Gbps of traffic flow
• CPU usage is 8-12% on 4 core Xeon 2.8Ghz
• Possible reasons
• Profiler shows a high number of thread synchronization calls from our app
(pthread_mutex_lock, pthread_mutex_unlock)
packets will only be delivered to the PF_RING client, and not the kernel network stack. Since the kernel is the slow part this ensures the fastest operation
If database doesn’t exist it got created
If database doesn’t exist it got created
How? Libpcap is single threaded
How to join producers with consumers?
How to join producers with consumers?
What is blocking queue?
What is blocking queue?
What is blocking queue?
Widely used in server applications – 1 request = 1 pool
Widely used in server applications – 1 request = 1 pool