This document summarizes some key design choices of the Go programming language that enable it to achieve high scalability. It discusses how Go uses lightweight goroutines instead of operating system threads to minimize context switching overhead. Goroutines only switch contexts in well-defined situations and have very small and dynamically sized stacks for efficient memory usage. The document also covers how Go uses cooperative scheduling and runs goroutines in user space for high performance without kernel transitions.
1. Design Choices of Golang
for High Scalability
SeongJae Park <sj38.park@gmail.com>
2. This work by SeongJae Park is licensed under the Creative
Commons Attribution-ShareAlike 3.0 Unported License. To
view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.
3. These slides were presented during
GDG Seoul Meetup 201709
(https://www.meetup.com/GDG-Seoul/events/242054608/)
4. Nice To Meet You
SeongJae Park
sj38.park@gmail.com
Part time linux kernel programmer at KOSSLAB
5. What Makes Golang So Special on Multicore?
● People says Go is a good choice for high performance and scalability
● Why scalability is so important?
● Why existing solutions are not sufficient?
● What makes Go so special for the problems?
● TL; DR: Goroutines, Dynamic stack management, and Integrated Poller
DISCLAIMER: This talk is based on Dave Chenny’s OSCON15 presentation
(http://cdn.oreillystatic.com/en/assets/1/event/129/High%20performance%20servers%20without%20the%20event%20loop%20Presentation.pdf)
8. ● Law: Number of transistors per square inch doubles roughly every 18 months
Moore’s Law
https://www.karlrupp.net/wp-content/uploads/2015/06/35years.png
# of transistors
Single thread perf
Clock speed
Power (Watts)
Number of cores
9. ● Law: Number of transistors per square inch doubles roughly every 18 months
● CPU vendors used the law to increase cpu clock speed; Only one thing that
programmers need to have for better performance was patience for free lunch
Moore’s Law
https://www.karlrupp.net/wp-content/uploads/2015/06/35years.png
# of transistors
Single thread perf
Clock speed
Power (Watts)
Number of cores
10. ● Law: Number of transistors per square inch doubles roughly every 18 months
● CPU vendors used the law to increase cpu clock speed; Only one thing that
programmers need to have for better performance was patience for free lunch
● However, CPU clock speed stopped to increase over a decade ago
Moore’s Law
https://www.karlrupp.net/wp-content/uploads/2015/06/35years.png
# of transistors
Single thread perf
Clock speed
Power (Watts)
Number of cores
11. Why No Clock Speed?
https://i.ytimg.com/vi/9S9vP2inD_U/maxresdefault.jpg
12. Why No Clock Speed?
● Electrons move between transistors for every clock
(Clock speed is analogous to switch on/off speed in below circuit diagram)
http://fourthgradespace.weebly.com/uploads/1/3/3/9/13397069/2935717_orig.jpg
https://i.ytimg.com/vi/9S9vP2inD_U/maxresdefault.jpg
13. Why No Clock Speed?
● Electrons move between transistors for every clock
(Clock speed is analogous to switch on/off speed in below circuit diagram)
● Moving a thing requires energy; We use electrical energy here
http://fourthgradespace.weebly.com/uploads/1/3/3/9/13397069/2935717_orig.jpg
https://i.ytimg.com/vi/9S9vP2inD_U/maxresdefault.jpg
14. Why No Clock Speed?
● Electrons move between transistors for every clock
(Clock speed is analogous to switch on/off speed in below circuit diagram)
● Moving a thing requires energy; We use electrical energy here
● Few of the electrical energy leaks from transformation to kinetic energy and
becomes heat energy; temperature goes high
http://fourthgradespace.weebly.com/uploads/1/3/3/9/13397069/2935717_orig.jpg
https://i.ytimg.com/vi/9S9vP2inD_U/maxresdefault.jpg
15. Why No Clock Speed?
● Electrons move between transistors for every clock
(Clock speed is analogous to switch on/off speed in below circuit diagram)
● Moving a thing requires energy; We use electrical energy here
● Few of the electrical energy leaks from transformation to kinetic energy and
becomes heat energy; temperature goes high
● High temperature damages CPU
http://fourthgradespace.weebly.com/uploads/1/3/3/9/13397069/2935717_orig.jpg
https://i.ytimg.com/vi/9S9vP2inD_U/maxresdefault.jpg
16. Why No Clock Speed?
● Electrons move between transistors for every clock
(Clock speed is analogous to switch on/off speed in below circuit diagram)
● Moving a thing requires energy; We use electrical energy here
● Few of the electrical energy leaks from transformation to kinetic energy and
becomes heat energy; temperature goes high
● High temperature damages CPU
● In short, increasing clock speed results in amplified power consumption, heat
dissipation, and CPU damage
http://fourthgradespace.weebly.com/uploads/1/3/3/9/13397069/2935717_orig.jpg
https://i.ytimg.com/vi/9S9vP2inD_U/maxresdefault.jpg
17. Moore’s Law is Still There, Vendors Are Changed
● In same clock speed, two 0.5-square inch processors would consume power
as similar as 1-square inch single processor
(Total distance of electrons movement per clock would be similar)
● Vendors now, thus, prefer to supply multi-core processors
http://happierhuman.wpengine.netdna-cdn.com/wp-content/uploads/2012/11/One-cookie-vs-two-cookies.jpg
18. Parallelism is Not Free
● Multi-core system cannot help zero-concurrency programs
● Just increasing concurrency does not guarantee proportional speedup;
Clumsy concurrency controls can make things even worse on multi-core
● Go has made important design choices for highly scalable concurrency
control. Remainder of this talk will describe some of the choices
https://img.devrant.io/devrant/rant/r_373632_a3SmV.jpg
20. Resource Sharing and Context
● Concurrent tasks share processors and memory
(Number of tasks is usually larger than number of processors)
● To pause and resume an execution, need to manage context of the task
○ Context in this context: point to next instruction, stack frames, data in registers, ...
https://headguruteacher.files.wordpress.com/2017/05/x20142711071202qitokro-s8uda-pagespeed-ic-afnisfpvf0.jpg?w=640
21. Process: Analogous to a room for lease
● Abstraction of an execution of given program
● Process context switching require many expensive operations
https://www.youtube.com/watch?v=4OclkGRLuxw
22. Process: Analogous to a room for lease
● Abstraction of an execution of given program
● Process context switching require many expensive operations
○ Finding out a process to run next, management of waiting / pending processes
https://www.youtube.com/watch?v=4OclkGRLuxw
23. Process: Analogous to a room for lease
● Abstraction of an execution of given program
● Process context switching require many expensive operations
○ Finding out a process to run next, management of waiting / pending processes
○ Back-up of current all CPU registers, restore all CPU registers to last backup of next process
https://www.youtube.com/watch?v=4OclkGRLuxw
24. Process: Analogous to a room for lease
● Abstraction of an execution of given program
● Process context switching require many expensive operations
○ Finding out a process to run next, management of waiting / pending processes
○ Back-up of current all CPU registers, restore all CPU registers to last backup of next process
○ Flush virtual memory mapping cache (TLB)
https://www.youtube.com/watch?v=4OclkGRLuxw
25. Process: Analogous to a room for lease
● Abstraction of an execution of given program
● Process context switching require many expensive operations
○ Finding out a process to run next, management of waiting / pending processes
○ Back-up of current all CPU registers, restore all CPU registers to last backup of next process
○ Flush virtual memory mapping cache (TLB)
○ All above operations should be run in operating system kernel; it means context switch
between user mode and kernel mode
https://www.youtube.com/watch?v=4OclkGRLuxw
26. Thread: a.k.a Light-Weight Process
● Threads are similar with processes but they share address space
● Because of address space sharing, thread context is smaller than process
context; Thread is faster than process for creation and switching
● Still context switch overhead exists
https://www.topdraw.com/assets/uploads/2015/04/standing-desk.jpg
27. Goroutine
● Not thread, not coroutine, goroutine.
● Major primitive of Go for concurrent task execution
● Designed to have minimal context overhead only
http://edinburghopendata.info/wp-content/uploads/2015/05/141107-hackathon_18_d893499f2c13fe1fa05bd46252246b1e.jpg
28. Goroutine: Co-operative scheduling
● Cooperative scheduling minimizes context switching itself
● Goroutines do context switch only in well-defined situations
https://renegadeinc.com/wp-content/uploads/2016/05/RInc-Cooperation-1969.jpg
29. Goroutine: Co-operative scheduling
● Cooperative scheduling minimizes context switching itself
● Goroutines do context switch only in well-defined situations
○ Channel send / receive operation
https://renegadeinc.com/wp-content/uploads/2016/05/RInc-Cooperation-1969.jpg
30. Goroutine: Co-operative scheduling
● Cooperative scheduling minimizes context switching itself
● Goroutines do context switch only in well-defined situations
○ Channel send / receive operation
○ `go` statement
https://renegadeinc.com/wp-content/uploads/2016/05/RInc-Cooperation-1969.jpg
31. Goroutine: Co-operative scheduling
● Cooperative scheduling minimizes context switching itself
● Goroutines do context switch only in well-defined situations
○ Channel send / receive operation
○ `go` statement
○ Blocking system calls (file or network I/O)
https://renegadeinc.com/wp-content/uploads/2016/05/RInc-Cooperation-1969.jpg
32. Goroutine: Co-operative scheduling
● Cooperative scheduling minimizes context switching itself
● Goroutines do context switch only in well-defined situations
○ Channel send / receive operation
○ `go` statement
○ Blocking system calls (file or network I/O)
○ Garbage collection
https://renegadeinc.com/wp-content/uploads/2016/05/RInc-Cooperation-1969.jpg
33. Goroutine: Co-operative scheduling
● Cooperative scheduling minimizes context switching itself
● Goroutines do context switch only in well-defined situations
○ Channel send / receive operation
○ `go` statement
○ Blocking system calls (file or network I/O)
○ Garbage collection
● If goroutines are not cooperative, starvation is possible
(https://gist.github.com/sjp38/dcdb6295e10f1cfe919b)
https://renegadeinc.com/wp-content/uploads/2016/05/RInc-Cooperation-1969.jpg
34. Goroutine: Minimized Context
● In case of processes or threads, kernel should backup / restore entire
registers because kernel doesn’t know which registers are actually in use
https://i.pinimg.com/originals/c3/38/5f/c3385f909b2d2c36877f7ad02f841471.jpg
35. Goroutine: Minimized Context
● In case of processes or threads, kernel should backup / restore entire
registers because kernel doesn’t know which registers are actually in use
● Go compiler emit code for actually using register check and backup of them
for the every context switching event
https://i.pinimg.com/originals/c3/38/5f/c3385f909b2d2c36877f7ad02f841471.jpg http://www.cohoots.info/wp-content/uploads/2017/07/coworking-space-Co-Hoots.jpg
36. Goroutine: User-space scheduling
● M goroutines are multiplexed onto N kernel threads by user space go runtime
scheduler
● No transition between user mode and kernel mode
https://image.slidesharecdn.com/realtime-linux-140810101151-phpapp02/95/making-linux-do-hard-realtime-74-638.jpg?cb=1429570932
37. Goroutine: Minimized Context Switch Overhead
● Minimize context switching
● Minimize size of context
● No transition between user mode and kernel mode at all
● As a result, Tens of thousands of goroutines in a single process are the norm
https://github.com/ashleymcnamara/gophers/blob/master/GOPHER_SHARE.png
39. ● Stack is a storage for task’s call frame
○ Each call frame stores where to return, parameters, local variables
● Should not be overlapped with other concurrent task’s stack
Stack
Parameters,
Return address,
local variables
Stack
Frame
Pointer
Stack
Pointer
Stack Frame
High
Low
Stack grows downside
40. Stack Management of Threads
● Threads allocate fixed size stack memory when created
http://docs.roguewave.com/legacy-hpp/thrug/images/stackallocation.gif
41. Stack Management of Threads
● Threads allocate fixed size stack memory when created
● By default, 2 MiB On Linux/x86-32. With pthreads library NPTL
implementation, stack size can be specified in thread creation time
http://docs.roguewave.com/legacy-hpp/thrug/images/stackallocation.gif
42. Stack Management of Threads
● Threads allocate fixed size stack memory when created
● By default, 2 MiB On Linux/x86-32. With pthreads library NPTL
implementation, stack size can be specified in thread creation time
● Too large stack size could limit number of concurrent threads
http://docs.roguewave.com/legacy-hpp/thrug/images/stackallocation.gif
43. Stack Management of Goroutines
● Compiler knows how many stack size is required for a given function
● Goroutine starts with very small stack
● Just before a function call, Go checks whether current stack can commodate
the function’s stack size requirement; If not sufficient with current stack,
increase the stack size
● The stack can be shrinked, too
● As a result, goroutines can keep only necessary size of stack and allow
maximum concurrent goroutines
func f() {
g()
}
go func() {
f();
}()
44. Stack Management of Goroutines
● Compiler knows how many stack size is required for a given function
● Goroutine starts with very small stack
● Just before a function call, Go checks whether current stack can commodate
the function’s stack size requirement; If not sufficient with current stack,
increase the stack size
● The stack can be shrinked, too
● As a result, goroutines can keep only necessary size of stack and allow
maximum concurrent goroutines
func f() {
g()
}
go func() {
f();
}()
Compiler
f() requires 1KiB stack,
g() requires 1.5KiB stack
45. Stack Management of Goroutines
● Compiler knows how many stack size is required for a given function
● Goroutine starts with very small stack
● Just before a function call, Go checks whether current stack can commodate
the function’s stack size requirement; If not sufficient with current stack,
increase the stack size
● The stack can be shrinked, too
● As a result, goroutines can keep only necessary size of stack and allow
maximum concurrent goroutines
func f() {
g()
}
go func() {
f();
}()
Compiler
f() requires 1KiB stack,
g() requires 1.5KiB stack
Goroutine starts with 2KiB
stack
46. Stack Management of Goroutines
● Compiler knows how many stack size is required for a given function
● Goroutine starts with very small stack
● Just before a function call, Go checks whether current stack can commodate
the function’s stack size requirement; If not sufficient with current stack,
increase the stack size
● The stack can be shrinked, too
● As a result, goroutines can keep only necessary size of stack and allow
maximum concurrent goroutines
func f() {
g()
}
go func() {
f();
}()
Compiler
f() requires 1KiB stack,
g() requires 1.5KiB stack
Goroutine starts with 2KiB
stack
f() will use 1KiB. Current
stack (2KiB free) is enough
47. Stack Management of Goroutines
● Compiler knows how many stack size is required for a given function
● Goroutine starts with very small stack
● Just before a function call, Go checks whether current stack can commodate
the function’s stack size requirement; If not sufficient with current stack,
increase the stack size
● The stack can be shrinked, too
● As a result, goroutines can keep only necessary size of stack and allow
maximum concurrent goroutines
func f() {
g()
}
go func() {
f();
}()
Compiler
f() requires 1KiB stack,
g() requires 1.5KiB stack
Goroutine starts with 2KiB
stack
f() will use 1KiB. Current
stack (2KiB free) is enough
g() will use 1.5KiB. Current
stack (1KiB free) is not
enough. Allocate bigger stack!
49. C10K Problem
● How to hold 10,000 concurrent sessions
● 10,000 threads for 10,000 sessions would incur high overhead
● Event loop usually results in complex callback spaghetti code
https://www.youtube.com/watch?v=SgjAv1TnS5k
50. Integrated Poller: Goroutines Allocation
● Allocate 10,000 goroutines for 10,000 concurrent sessions;
Don’t worry, goroutine creation is fast enough;
tens of thousands of goroutines in single process is norm
● Goroutines waiting for events are just scheduled out
Go scheduler would not increase number of threads under the hood because
most of goroutines would scheduled out due to slow event completion time
https://github.com/ashleymcnamara/gophers/blob/master/GOPHER_MIC_DROP.png https://github.com/ashleymcnamara/gophers/blob/master/DRAWING_GOPHER.png
51. Integrated Poller: Polling and Scheduling
● Runtime of Go uses select / kqueue / epoll / IOCP to know which socket is
ready instead of the goroutine for the socket
● As runtime knows which goroutine is waiting for the socket, runtime put the
goroutine back on the same CPU as soon as the socket is ready
● In short, waiting for event and waking up appropriate goroutine is dedicated to
Go runtime while
● As a result, gophers can enjoy Simple programming model and Appropriate
context management overhead
https://talks.golang.org/2012/waza.slide#22
52. Conclusion
● Go is so special on multi-core system owing to its clever design choices
● Goroutine is super cheap, fast for context management
● Dynamic size stack management of goroutine allows more concurrency
● Integrated Poller in Go help gophers to have only benefit of threads and event
loop
https://github.com/ashleymcnamara/gophers/blob/master/GOPHER_LEARN.png