Realtime traffic analyser

Lessons we learned while building
real-time network traffic analyzer in
C/C++
Alex Moskvin
CEO/CTO @ Plexteq

About myself
• CEO/CTO Plexteq OÜ
• Ph.D in information technology area
• Interests
• Software architecture
• High loaded systems
• Everything under the hood
• AI/ML + BigData
• Knowledge sharing ;)
• Follow me
• https://twitter.com/amoskvin
• https://www.facebook.com/moskvin.aleksey

Plexteq
• High loaded backends
• Complex distributed data processing
pipelines
• Big Data / BI
• We have our custom products
(hardware + software solutions)
We are hiring! ;)

Agenda
1. What was the whole stuff about
2. How we decided to solve it
3. Challenges we faced
4. Lessons we learned

Disclaimer ;)
This talk is based on personal experience.
Use at your own risk.

Task definition
• Network services provider needs:
• Analyse threats/interactions in past
• Realtime network spikes indication
• Aggregate metadata from hundreds of systems
• Solution should be
• fast, resource efficient (no CPU/RAM hogging)
• potentially needs to be cross-platform
• Easy to integrate with ETL and BI systems
• Regular bandwidth: 100-1000Mbps

Data model
2 dimensions
Per port
Time period
Source IP
Destination port
Protocol type
In bytes
Out bytes
In packets
Out packets
Per protocol type
Time period
TCP/UDP/… traffic in
bytes
TCP/UDP/… traffic in
bytes
Protocol type
In bytes
Out bytes
In packets
Out packets

Existing solutions
• tcpdump
• wireshark
• iptables

Existing solutions
$ for i in 1 2 3; do
some tcpdump exercise
done

Existing solutions
$ tcpdump -i eth0
$ tcpdump tcp port 443
$ tcpdump tcp ‘port 443 or port 80’
$ tcpdump tcp ‘port 443 or port 80’ -w out-file

Existing solutions
• Drawbacks
• tcpdump / wireshark
• Single threaded
• Large disk space overhead (without hacking will write packet contents)
• Not possible to write with custom data format (extra parsing efforts of .pcap file is
needed)
• Iptables
• Could work, but will be hard to customize in case of further feature requests
• Not cross-platform

Existing solutions
We want our own bicycle ;)

Main functions
Okay, so we want to capture traffic from the kernel.
How should we do it?

Traffic capturing
• Raw sockets
• pf_ring
• 3rd party libraries
• libtins
• pcapplusplus
• libpcap

Traffic capturing :: Raw sockets

Traffic capturing :: Raw sockets
Drawbacks:
• Kernel-to-userspace copies
• Developer needs to be proficient with
packet structure and low level
networking semantics, i.e. endianness

Traffic capturing :: pf_ring
PF_RING – kernel bypass
Motivation:
• Kernel is very slow 
• Vanilla kernel can handle 1-2Mpps
• PF_RING can do 15+Mpps on commodity hardware
Pros
• Huge workloads
• Could be used for network server application development
• Zero copy technique
Cons
• Complicated API
• Support on network card driver level is preferred
• PF_RING ZC API is complex
• Not cross platform

Traffic capturing :: 3rd party libs
Pros:
• Cross platform
• May utilize low level OS dependent optimizations and extensions, i.e. PF_RING

Traffic capturing :: winner
libpcap
• Cross platform
• Supports PF_RING
• The most fast implementation
• Well maintained
• Relatively easy API

Solutions to store data
We wanted something that:
• Has small footprint and fast
• Preferably one file database
• Embeddable
• Supports SQL
• Supports B-tree indices

Solutions to store data
We wanted something that:
• Has small footprint and fast
• Preferably one file database
• Embeddable
• Supports SQL
• Supports B-tree indices
Drawbacks:
• Single threaded – we need to synchronize/serialize write ops to it in our
application

We have core tool chain now!
Let’s glue it up together

Producer-consumer problem
• Issues:
• Aggregator is not following up on traffic > 25Mbps
• We have a significant increasing delay between incoming traffic and flushed
stats
This is actually a producer-consumer type of problem

We need to handle packets in
multiple threads

• Solution:
• Producer runs in separate thread
• Multiple consumers that run in separate threads

• Solution:
• Producer runs in separate thread
• Multiple consumers that run in separate threads
Possible implementations:
• Message broker
• Blocking queue

We need a blocking queue
For this purpose

Very good implementation: APR (Apache Portable Runtime)
Used by Apache web server
http://apr.apache.org/docs/apr-util/1.3/apr__queue_8h.html

Packet processing flow
• Issues:
• Application is capable to handle about 82Mbps of traffic flow
• CPU usage is 100+% utilized by our app (eaten by malloc calls)

Memory allocation
• Issues:
• Application is capable to handle about 82Mbps of traffic flow
• CPU usage is 100% utilized by our app (eaten by malloc calls)
• Business logic needed at least 1 malloc when packet stats got aggregated in in-
memory data structure

Malloc issue
Solution:
• Use memory pooling

Malloc issue
Solution:
Blockpre-allocate
withmalloc
Allocations within a block
(eventually allocation within block = pointer arithmetic)

Malloc issue
Solution:
Blockpre-allocate
withmalloc
Allocations within a block
(eventually allocation within block = pointer arithmetic)
Drawbacks:
• Can’t do free for an individual
allocation within a block

Packet processing flow
Some implementations
• APR (https://apr.apache.org/docs/apr/1.6/group__apr__pools.html)
• Mpool (https://github.com/silentbicycle/mpool)

Mutexes
• Results:
• Linux:
• Application is capable to handle ~1Gbps of traffic flow
• CPU usage is 10-15% on 4 core Xeon 2.8Ghz
• FreeBSD/OSX
• Application is capable to handle ~615Mbps of traffic flow
• CPU usage is 35% on 4 core Xeon 2.8Ghz

Mutexes
• Results:
• Linux:
• FreeBSD/OSX
• Application is capable to handle ~615Mbps of traffic flow
• CPU usage is 35% on 4 core Xeon 2.8Ghz
• Possible reasons
• Profiler shows a high number of thread synchronization calls from our app
(pthread_mutex_lock, pthread_mutex_unlock)

Mutexes
• Investigation:
• pthread_mutex_* in Linux is implemented using futexes (fast user-space
mutex), no locking, no context switching
• POSIX is a standard, it doesn’t require specific implementation
• OSX/FreeBSD use heavier approach with

Mutexes
• Thread synhronization approaches:
• Lock based
• Semaphore
• Mutex
• Lock free
• Futex (could lock in an edge case)
• Spin lock
• CAS based spin lock

Mutexes
• Our target critical section:
• No IO operations
• Just pointer operations, arithmetic operations and allocations on memory
pool
• Options
• Spin lock from OS
• Custom spin lock based on CAS operations

Mutexes
• Our target critical section:
• No IO operations
• Just pointer operations, arithmetic operations and allocations on memory
pool
• Options
• Spin lock from OS
• pthread_spin_lock
• Custom spin lock based on CAS operations
• GCC atomic built ins
• __sync_lock_test_and_set
• __sync_lock_release

Mutexes
1) volatile suggests that “lock”may be changed by other threads
2) __sync_lock_test_and_set, __sync_lock_release
Are atomic built ins which guarantee atomic memory access
3) __sync_lock_test_and_set atomically sets 1 and returns 0
4) If lock == 1, we keep looping until another thread calls
__sync_lock_release

Mutexes
• Results:
• Linux:
• FreeBSD/OSX
• Possible reasons
• Profiler shows a high number of thread synchronization calls from our app
(pthread_mutex_lock, pthread_mutex_unlock)

Realtime traffic analyser

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Realtime traffic analyser

Ähnlich wie Realtime traffic analyser (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Realtime traffic analyser

Hinweis der Redaktion