Design choices of golang for high scalability

Design Choices of Golang
for High Scalability
SeongJae Park <sj38.park@gmail.com>

This work by SeongJae Park is licensed under the Creative
Commons Attribution-ShareAlike 3.0 Unported License. To
view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.

These slides were presented during
GDG Seoul Meetup 201709
(https://www.meetup.com/GDG-Seoul/events/242054608/)

Nice To Meet You
SeongJae Park
sj38.park@gmail.com
Part time linux kernel programmer at KOSSLAB

What Makes Golang So Special on Multicore?
● People says Go is a good choice for high performance and scalability
● Why scalability is so important?
● Why existing solutions are not sufficient?
● What makes Go so special for the problems?
● TL; DR: Goroutines, Dynamic stack management, and Integrated Poller
DISCLAIMER: This talk is based on Dave Chenny’s OSCON15 presentation
(http://cdn.oreillystatic.com/en/assets/1/event/129/High%20performance%20servers%20without%20the%20event%20loop%20Presentation.pdf)

Why Scalability?
A long time ago, in a galaxy far, far away...

Moore’s Law
https://www.karlrupp.net/wp-content/uploads/2015/06/35years.png

● Law: Number of transistors per square inch doubles roughly every 18 months
Moore’s Law
# of transistors
Single thread perf
Clock speed
Power (Watts)
Number of cores

● CPU vendors used the law to increase cpu clock speed; Only one thing that
programmers need to have for better performance was patience for free lunch
Moore’s Law
# of transistors
Single thread perf
Clock speed
Power (Watts)
Number of cores

● CPU vendors used the law to increase cpu clock speed; Only one thing that
programmers need to have for better performance was patience for free lunch
● However, CPU clock speed stopped to increase over a decade ago
Moore’s Law
# of transistors
Single thread perf
Clock speed
Power (Watts)
Number of cores

Why No Clock Speed?
https://i.ytimg.com/vi/9S9vP2inD_U/maxresdefault.jpg

Why No Clock Speed?
● Electrons move between transistors for every clock
(Clock speed is analogous to switch on/off speed in below circuit diagram)
http://fourthgradespace.weebly.com/uploads/1/3/3/9/13397069/2935717_orig.jpg

Why No Clock Speed?
● Moving a thing requires energy; We use electrical energy here

Why No Clock Speed?
● Few of the electrical energy leaks from transformation to kinetic energy and
becomes heat energy; temperature goes high

Why No Clock Speed?
● High temperature damages CPU

Why No Clock Speed?
● High temperature damages CPU
● In short, increasing clock speed results in amplified power consumption, heat
dissipation, and CPU damage

Moore’s Law is Still There, Vendors Are Changed
● In same clock speed, two 0.5-square inch processors would consume power
as similar as 1-square inch single processor
(Total distance of electrons movement per clock would be similar)
● Vendors now, thus, prefer to supply multi-core processors
http://happierhuman.wpengine.netdna-cdn.com/wp-content/uploads/2012/11/One-cookie-vs-two-cookies.jpg

Parallelism is Not Free
● Multi-core system cannot help zero-concurrency programs
● Just increasing concurrency does not guarantee proportional speedup;
Clumsy concurrency controls can make things even worse on multi-core
● Go has made important design choices for highly scalable concurrency
control. Remainder of this talk will describe some of the choices
https://img.devrant.io/devrant/rant/r_373632_a3SmV.jpg

Context Management
Process? Thread? Goroutine!

Resource Sharing and Context
● Concurrent tasks share processors and memory
(Number of tasks is usually larger than number of processors)
● To pause and resume an execution, need to manage context of the task
○ Context in this context: point to next instruction, stack frames, data in registers, ...
https://headguruteacher.files.wordpress.com/2017/05/x20142711071202qitokro-s8uda-pagespeed-ic-afnisfpvf0.jpg?w=640

Process: Analogous to a room for lease
● Abstraction of an execution of given program
● Process context switching require many expensive operations
https://www.youtube.com/watch?v=4OclkGRLuxw

○ Finding out a process to run next, management of waiting / pending processes

○ Back-up of current all CPU registers, restore all CPU registers to last backup of next process

○ Flush virtual memory mapping cache (TLB)

○ Flush virtual memory mapping cache (TLB)
○ All above operations should be run in operating system kernel; it means context switch
between user mode and kernel mode

Thread: a.k.a Light-Weight Process
● Threads are similar with processes but they share address space
● Because of address space sharing, thread context is smaller than process
context; Thread is faster than process for creation and switching
● Still context switch overhead exists
https://www.topdraw.com/assets/uploads/2015/04/standing-desk.jpg

Goroutine
● Not thread, not coroutine, goroutine.
● Major primitive of Go for concurrent task execution
● Designed to have minimal context overhead only
http://edinburghopendata.info/wp-content/uploads/2015/05/141107-hackathon_18_d893499f2c13fe1fa05bd46252246b1e.jpg

Goroutine: Co-operative scheduling
● Cooperative scheduling minimizes context switching itself
● Goroutines do context switch only in well-defined situations
https://renegadeinc.com/wp-content/uploads/2016/05/RInc-Cooperation-1969.jpg

○ Channel send / receive operation

○ `go` statement

○ `go` statement
○ Blocking system calls (file or network I/O)

○ `go` statement
○ Garbage collection

○ `go` statement
○ Garbage collection
● If goroutines are not cooperative, starvation is possible
(https://gist.github.com/sjp38/dcdb6295e10f1cfe919b)

Goroutine: Minimized Context
● In case of processes or threads, kernel should backup / restore entire
registers because kernel doesn’t know which registers are actually in use
https://i.pinimg.com/originals/c3/38/5f/c3385f909b2d2c36877f7ad02f841471.jpg

Goroutine: Minimized Context
● In case of processes or threads, kernel should backup / restore entire
registers because kernel doesn’t know which registers are actually in use
● Go compiler emit code for actually using register check and backup of them
for the every context switching event
https://i.pinimg.com/originals/c3/38/5f/c3385f909b2d2c36877f7ad02f841471.jpg http://www.cohoots.info/wp-content/uploads/2017/07/coworking-space-Co-Hoots.jpg

Goroutine: User-space scheduling
● M goroutines are multiplexed onto N kernel threads by user space go runtime
scheduler
● No transition between user mode and kernel mode
https://image.slidesharecdn.com/realtime-linux-140810101151-phpapp02/95/making-linux-do-hard-realtime-74-638.jpg?cb=1429570932

Goroutine: Minimized Context Switch Overhead
● Minimize context switching
● Minimize size of context
● No transition between user mode and kernel mode at all
● As a result, Tens of thousands of goroutines in a single process are the norm
https://github.com/ashleymcnamara/gophers/blob/master/GOPHER_SHARE.png

Stack Management
Finding optimal size of stack

● Stack is a storage for task’s call frame
○ Each call frame stores where to return, parameters, local variables
● Should not be overlapped with other concurrent task’s stack
Stack
Parameters,
Return address,
local variables
Stack
Frame
Pointer
Stack
Pointer
Stack Frame
High
Low
Stack grows downside

Stack Management of Threads
● Threads allocate fixed size stack memory when created
http://docs.roguewave.com/legacy-hpp/thrug/images/stackallocation.gif

● By default, 2 MiB On Linux/x86-32. With pthreads library NPTL
implementation, stack size can be specified in thread creation time

● By default, 2 MiB On Linux/x86-32. With pthreads library NPTL
implementation, stack size can be specified in thread creation time
● Too large stack size could limit number of concurrent threads

Stack Management of Goroutines
● Compiler knows how many stack size is required for a given function
● Goroutine starts with very small stack
● Just before a function call, Go checks whether current stack can commodate
the function’s stack size requirement; If not sufficient with current stack,
increase the stack size
● The stack can be shrinked, too
● As a result, goroutines can keep only necessary size of stack and allow
maximum concurrent goroutines
func f() {
g()
}
go func() {
f();
}()

func f() {
g()
}
go func() {
f();
}()
Compiler
f() requires 1KiB stack,
g() requires 1.5KiB stack

func f() {
g()
}
go func() {
f();
}()
Compiler
Goroutine starts with 2KiB
stack

func f() {
g()
}
go func() {
f();
}()
Compiler
stack
f() will use 1KiB. Current
stack (2KiB free) is enough

func f() {
g()
}
go func() {
f();
}()
Compiler
stack
f() will use 1KiB. Current
stack (2KiB free) is enough
g() will use 1.5KiB. Current
stack (1KiB free) is not
enough. Allocate bigger stack!

C10K Problem
without EventLoop
Event? Threads? Goroutines and Integrated Poller!

C10K Problem
● How to hold 10,000 concurrent sessions
● 10,000 threads for 10,000 sessions would incur high overhead
● Event loop usually results in complex callback spaghetti code
https://www.youtube.com/watch?v=SgjAv1TnS5k

Integrated Poller: Goroutines Allocation
● Allocate 10,000 goroutines for 10,000 concurrent sessions;
Don’t worry, goroutine creation is fast enough;
tens of thousands of goroutines in single process is norm
● Goroutines waiting for events are just scheduled out
Go scheduler would not increase number of threads under the hood because
most of goroutines would scheduled out due to slow event completion time
https://github.com/ashleymcnamara/gophers/blob/master/GOPHER_MIC_DROP.png https://github.com/ashleymcnamara/gophers/blob/master/DRAWING_GOPHER.png

Integrated Poller: Polling and Scheduling
● Runtime of Go uses select / kqueue / epoll / IOCP to know which socket is
ready instead of the goroutine for the socket
● As runtime knows which goroutine is waiting for the socket, runtime put the
goroutine back on the same CPU as soon as the socket is ready
● In short, waiting for event and waking up appropriate goroutine is dedicated to
Go runtime while
● As a result, gophers can enjoy Simple programming model and Appropriate
context management overhead
https://talks.golang.org/2012/waza.slide#22

Conclusion
● Go is so special on multi-core system owing to its clever design choices
● Goroutine is super cheap, fast for context management
● Dynamic size stack management of goroutine allows more concurrency
● Integrated Poller in Go help gophers to have only benefit of threads and event
loop
https://github.com/ashleymcnamara/gophers/blob/master/GOPHER_LEARN.png

Design choices of golang for high scalability

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Design choices of golang for high scalability

Similar to Design choices of golang for high scalability (20)

More from SeongJae Park

More from SeongJae Park (20)

Recently uploaded

Recently uploaded (20)

Design choices of golang for high scalability