An embedded system usually involves low level languages like C and highly customized hardware. In this talk we will see a use case of a soft real time system which was developed taking a very different approach, written in Go. We will see what are the advantages of this choice, along with its limits.
4. ● Industrial machines
● Quality features
○ Color
○ Weight
○ Defects
○ Shape
○ ...
● Classification
○ Grouping items together,
according to quality features
Quality Classifier
Photo by Kate Buckley on flickr
6. Specs & Outline
● 100 lanes
● 20 items/sec per lane
● 2000 items/sec
● 10 exits per lane
● Industrial scale
Photo by Chris Chadd from Pexels
Feeder
Sensor
Exit
Lane
Rotary
Encoder
Ejector
Classify
7. Need of Precision
● Items are eventually ejected
○ Precise timing of ejection
○ Precision of 250 us
○ Multiple exits
● Usually real time OS are used
○ Higher determinism
9. Our Machine Layout
IO IO
exit 1 exit 2
ejectorssensors
BL
data in data out
checkpoint
● Boards
○ BL: Business Logic
○ IO: Input/Output
● Business Logic
○ Acquires data from sensors
○ Manages every lane
● Network traffic is heavy
○ Up to 250 sensors
○ Up to 2000 items per second
● Checkpoint
○ Trigger for classification
10. The Challenge
Canonical Way
● RTOS kernel
● Custom hardware & boards
● CANBUS communication
● Single board
● Firmware C bare metal
11. The Challenge: Linux and Go
Canonical Way
● RTOS kernel
● Custom hardware & boards
● CANBUS communication
● Single board
● Firmware C bare metal
Our Solution
● GNU/Linux standard kernel
● Hardware standard components
● Ethernet based communication
● Distributed system
● Go language
12. Why Linux?
GNU/Linux
● Real time processes
● Microprocessor boards
● No Safety Certifications
● Plenty of Drivers
● Separation of competences
● Debug on desktop with tools
● Many Languages & Libraries
RTOS
● Tasks with priorities
● Microcontrollers (no MMU)
● Safety and Certifications
● Limited number of Drivers
● Single big application
● Debug on hardware boards
● Few Languages and Libraries
13. Network connections
● “BL” Single Business Logic board
○ Freescale i.MX 6, Quad Core ARM Cortex A9 @ 1.2 GHz
○ Performs the items classification for every lane
● “IO” Multiple Input/Output boards
○ Develboard Atmel, Single Core ARM @ 600 MHz
○ Digital inputs and outputs
● Multiple sensor sensors
● Ethernet bus with standard switches/routers
BL
IO IO IO
Ethernet
Switch
Star topology
14. Different topology with Linux Sw bridge
BL
IO IO IO
BL
IO IO IO
Star topology
Serial topology
● Simplified cabling
● Sw bridge: 15% CPU
15. Latency for Soft Real Time
● Kernel driver with a precision of 250 us
○ DMA + double buffering
○ Buffer has a duration of 100 ms
○ Actual precision of 66 us
● Queue of scheduled activations
○ User space software writes activations to the kernel driver
● Soft real time latency
○ 100 ms + queue management ~= 150 ms
○ System can’t react faster than 150 ms (e.g. change of speed)
16. Rotary encoder
● Lanes are physically bound
○ Multiple encoders in case of
big machines
● Encoder steps
○ Square waves
○ 2000 steps/round
● Kernel driver
○ Parameters exported in sysfs
A
B
Z
17. Linear Interpolation
● We cannot synchronize thousands
times per second with Ethernet
● Synchronization every 100 ms
● Linear interpolation
○ Encoder accelerations are “slow” because
it’s bound to a mechanical transport
● Workaround of a real time protocol
t
#step
● Real curve
● Interpolated curve
18. BL/IO Clock Synchronization
● Activation messages are marked with a specific timestamp
○ We need to synchronize clocks
● Usually, NTP is used
○ Precision of ~milliseconds => Not enough for us
○ We need a precision of (at least) 250 us
● Precision Time Protocol IEEE 1585 (PTP)
● Two PTP timestamping models: Hardware or Software
○ Software timestamping: Kernel interrupt => Precision of ~microseconds
○ Hardware timestamping: Ethernet interface => Precision of ~nanoseconds
○ Develboard supports IEEE 1858, but software timestamping is enough
20. Basic Advantages
● Simple language, clients got used to it very quickly
● Simple documentation and maintenance
● Static binaries
● Large ecosystem of libraries
● Concurrent programming
● Easy cross compilation
○ Embedded (ARM)
○ Windows
○ Linux
21. Embedded: Go vs C++
● Stack trace and data race analysis
○ Valgrind slows down performance
● Debug tools
○ Remote debugging and system analysis (gdb vs pprof)
● Linter and code analysis
○ Easier to integrate static analysis tools (e.g. golint, go vet)
● Tags (go build -tags …)
○ Useful for embedded apps and stubs
○ Cleaner approach compared to #ifdef
22. Fine tuning: Disassembly
● go tool objdump -s main.join -S <binary_name>
func join(strings []string) string {
0x8bae0 e59a1008 MOVW 0x8(R10), R1
0x8bae4 e15d0001 CMP R1, R13
0x8bae8 9a00001a B.LS 0x8bb58
e 0x8baec e52de024 MOVW.W R14, -0x24(R13)
0x8baf0 e3a00000 MOVW $0, R0
0x8baf4 e3a01000 MOVW $0, R1
0x8baf8 e3a02000 MOVW $0, R2
for _, str := range strings {
0x8bafc ea00000f B 0x8bb40
0x8bb00 e58d0020 MOVW R0, 0x20(R13)
0x8bb04 e59d3028 MOVW 0x28(R13), R3
0x8bb08 e7934180 MOVW (R3)(R0<<3), R4
0x8bb0c e0835180 ADD R0<<$3, R3, R5
0x8bb10 e5955004 MOVW 0x4(R5), R5
package main
import (
"fmt"
"os"
)
func join(strings []string) string {
var ret string
for _, str := range strings {
ret += str
}
return ret
}
func main() {
fmt.Println(join(os.Args[1:]))
}
23. How do we perform tests on Embedded?
1. Unit tests
2. Full integration tests
○ Integration framework
○ Mocking board/instruments as Goroutines
○ Easier than C++
○ Fast prototyping for tests
3. Continuous integration
○ The real embedded system was simulated on CircleCI
24. ● Monitoring of performance
○ Metrics
○ Profiling
● Google pprof upstream version:
○ go get -u github.com/google/pprof
● Small CPU profile file => 10 minutes execution => just 185 KiB
○ Stand alone, no binary
○ Can read from both local file or over HTTP
○ pprof -http :8081 http://localhost:8080/debug/pprof/profile?seconds=30
Avoid Performance Regression
25. Hardware in the Loop
● Automatic performance monitoring
● We have a real hardware test bench
● We want to deploy our system
directly to the test bench
● Results from the test bench
are retrieved by CircleCI
Repo
CI
Metrics
Hardware
26. Remote Introspection via Browser
● Uncommon in embedded apps
● Expvar
○ Standard interface for public variables
○ Exports figures about the program
○ JSON format
// At global scope
var requestCount = expvar.NewInt("RequestCount")
...
func myHandler(w http.ResponseWriter, r *http.Request) {
requestCount.Add(1)
...
}
28. Metrics
● Performance analysis
○ We don’t want performance regressions
○ Refactoring
○ Test suites don’t help
● “Tachymeter” library to monitor metrics
○ Low impact, samples are added to a circular buffer
○ Average, Standard Deviation, Percentiles, Min, Max, …
● Multiple outputs
○ Formatted string, JSON string, Histogram text and html
○ HTTP endpoint for remote analysis
29. Checkpoint Margin
● Average
○ Avg 2.301660948s
○ StdDev 176.75148ms
● Percentiles
○ P75 2.222552667s
○ P95 1.921699001s
○ P99 1.721095s
○ P999 1.575430001s
● Limits
○ Max 2.916016667s
○ Min 1.464427001s
checkpointsensors
margin
activation
2 minutes run
30. How is Checkpoint Margin affected?
● I/O bound
● Reading packets from connections
● We need to read fairly from 250 tcp sockets
eth/tcpBL
S
S
S
31. Standard Network Loop
● One Goroutine per connection
○ 1. Read data from network
○ 2. Decode packets
○ 3. Send to main loop via channel
● chan packet
○ Sending one packet at time
to the main loop
● Can we do better?
main
loop
TCP
gorunTCP
gorunTCP
gorunTCP
Read
chan
packet
Concurrent
Goroutines
32. Batched Channel
● chan packet vs chan []packet
○ Sending one packet at time
is too slow
● Use a single channel write
operation to send all packets
received from a single TCP read
○ Minimizing channel writes is good
main
loop
TCP
gorunTCP
gorunTCP
gorunTCP
Read
chan
[ ]packet
Concurrent
Goroutines
33. Number of Channel Writes
● Channel
○ Buffered
○ Slice of packets
● Writes per second
○ 2000 → 25000 w/s
● Total GC STW time
○ 2.28 → 11.50 s
Channel Writes per Second [w/s]
● Checkpoint Margin [s]
● GC Time [s]
2 minutes run
34. Failed Test: Using a Mutex
● Goroutines will block on a mutex
○ High contention
○ Go scheduler is cooperative
● Deadline missed
○ Checkpoint event is delayed
● Conn.Read(): Channel Mutex
○ Min 13 us 13 us
○ Max 773 us 1.15 s
○ P99 64 us 510 ms
● Activation margin: Channel Mutex
○ P99 466 ms -1.13 s
main
loop
TCP
gorunTCP
gorunTCP
gorunTCP
Read
mutex
checkpoint
margin
activation
delay
35. Alternative: Using EPOLL
● EPOLL syscall allows to use
a single Goroutine
● MultiReader Go interface
○ Reading from multiple connections
○ Monitoring of multiple file descriptors
● Drawbacks
○ It can’t be used on Windows
○ Cannot use net.Socket
○ Maintenance
main
loop
TCP
Multi
Read
type MultiPacketReader interface {
// TCP connection with framing
Register(conn PacketConn)
// Reads from one of the
// registered connections
ReadPackets(packets [][]byte)
(n int, conn PacketConn, err error)
}
chan
[ ]packet
36. CPU Usage: EPOLL VS Go
● 4 CPUs in total
○ Graph shows just one CPU
(for simplicity)
● Go impl
○ CPU usage is higher...
○ … but more “uniform”
● EPOLL impl
○ CPU cores are switched
more frequently
● EPOLL
● Go
Time (2 minutes)
37. Conclusions
Thanks
mirko@develer.com
● Standard Linux OS and hardware
○ Faster development
○ Distributed system
● Testing and monitoring
○ Fast prototyping for tests
○ Profiling and metrics
○ Performance tests on real hardware
● Optimizations
○ Goroutines management
○ Packets reception
● Drawbacks
○ GC impact must be reduced
○ Mutex contention can be a problem
○ Network APIs are not flexible enough
● Go can be used for embedded apps!