Multicore

Multi core programming
Mark Veltzer
mark@veltzer.net

Introduction
● Most modern computers and even cell phones
have more than one physical device that can run
code.
● These devices, whether you call them cores or
CPUs can operate at the same time.
● The idea was to speed up processing when
making the transistor smaller became too
difficult.
● Add more cores and you are done, right?!?
Wrong...

Adding CPUs is not everything...
● There are resources which are shared between CPUs and
as you add CPUs these resources become bottle necks and
start giving you headaches...
● This means that there is a very real limit to the number of
CPUs you can just add.
● In addition each CPU is very expensive in terms of power
consumption and other internal infrastructure (caches etc)
● This makes you want to share this infrastructure between
CPUs.
● In comes multi-core.

Multi Core vs Multi CPU
● the terms multi core and multi CPU are highly
abused and used with various meaning.
● We will use the following meanings:
● A multi CPU machine is a machine with more
than one execution unit where little is shared
between CPUs.
● A multi core machine is a machine where more
is shared between CPUs.
● This definition is not very precise.

There is always sharing
● CPUs always share somethings...
● Main memory is a good example – if both CPUs
try to access main memory at the same time
one of them is delayed a little.
● Multi core CPUs share more than just main
memory: various caches, parts of the execution
pipeline, interrupt controller and more.
● What exactly is shared depends on the
architecture.

Why should I care about Multi core?
● The main reason is performance.
● If two CPUs that you are running on share stuff
then they could, at some conditions, cause
delays one for the other.
● This could be costly.
● Knowing what could be delayed would mean
that you can program differently and avoid the
delay.

Types of sharing we don't care
about
● There are other types of sharing we don't care
about like voltage controllers, physical sockets
and more.
● These do not cause any delays of significance.
● In general the more sharing the more mutual
influence the CPUs have one on the other and
the more potential delays could be created, the
more room we have for optimisation.

How do I know how many cores I
have?
● read the manual of your mother board?!?
● $ cat /proc/cpuinfo | grep processor
● number of configured processors: $ getconf
_NPROCESSORS_CONF
● number of online processors: $ getconf
_NPROCESSORS_ONLN
● $ cat /sys/devices/system/cpu/
{present,offline,online}
● Programatically: sysconf(3)

Number of cores is not constant
● Linux supports hot plugging of CPUs on various
architectures.
● Most user space programmers don't care too
much about this.
● Adding a CPU is not a big problem.
● Removing a CPU works by moving load from a
running core to other cores.
● Of course, you can't remove the last core.

What is the maximum number of
CPUs I can run on?
● Sometimes you need access to this information in order to hold
data per CPU.
● you can ask how many processors are currently online but new
processors may be added at runtime.
● the operating system kernel is compiled to a maximum number
of CPUs in it's data structures. You can use that number in your
code as well.
● you can get that from the kernel headers package.
● or you can read /sys/devices/system/cpu/kernel_max

How multicore vs multi CPU effects
programming...
● Multicore CPUs will see memory changes
earlier.
● Multicore will slow each other more since they
share resources.
● Memory communication between multicore
processes will be faster
● In general our intuition about multi – CPU
machines is more akin to multi CPU machines
than to multi core machines.

Multicore and Multi-CPU are not
mutually exclusive...
● In todays architectures it is common to find a
single system which has both multi core and
multi CPU.
● A single system could have 4 cores, packed as
a 2x2 multi-core system.
● This means that the two pairs of CPUs share
little while each pair shares a lot.
● You need to know about all of this – this is
called the CPU topology.

Does anyone look at the CPU
topology?
● Yes, the OS kernel knows about it and uses this
information for better scheduling and other
uses.
● You can use the information from:
/sys/devices/system/cpu/cpu[n]/topology/*
● You can do that at compile time to produce
various versions for various topologies, using a
config file per topology or even dynamically at
run time...

A little bit about timing...
● In order to appreciate performance benefits we need to
know a little about timings. The following are only ball
park figures...
● Single machine instruction without memory access or
that only accesses the cache: nanoseconds...
● Single machine instruction which accesses main
memory: hundreds of nanoseconds
● Single OS system call: microsecond
● Read data from physical hard drive: miliseconds

What are pipe lines in the CPU?
● A clock in a CPU is simply a oscillator crystal at
a constant rate that synchronizes the operation
of all the parts of the CPU.
● On a steady clock CPU the clock also has to be
longer than the longest executing instruction
(bummer!)
● This means that it is difficult to increase the
clock rate. Even if most instructions are very
short the longest one determines the clock rate.

RISC
● One solution was to simpify instructions: break
down complicated instructions into simple ones.
● Now the longest instruction is much shorter and
the CPU clock could be increased.
● But now you need to fetch more instructions and
have a bigger instruction cache.
● Compilers also work harder because even simple
operations require generation of several machine
instructions - and object files become bigger.

Pipe lines
●
A different solution was to keep instructions complex but break
them down inside the cpu.
● Each instruction will be broken up to several stages.
● This way several instructions could be executed at the same time
because every stage will be executed separately
● Clock rate can now be increased to match the time it takes the
most complicated stage of the most complicated instruction to
execute.
● Compilers don't have to work hard.
● Because increasing the clock became hard engineers decided on
a pipe line approach to increasing the CPU speed.

How pipelines effect programming?
● the CPU misses branch prediction as a result of
the pipelining process.
● we need to tell the CPU how likely/unlikey our
branch is using hints to the compiler.
● code which is very un regular (noisy and hard
to predict) will be slower because it will confuse
the branch prediction in the CPU.
● See example...

What are cache lines?
● when the CPU fetches memory it does not fetch it
one address at a time.
● instead it fetches a full "cache line".
● typical cache lines are 16 to 128 bytes and are
usually aligned.
● for example, on current intel processors cache
lines are 64 bytes.
● Cache line should not be confused with memory
page which is a much bigger construct (4096 bytes
on Intel).

How to get the cache line?
● On the command line:
● $ getconf LEVEL1_DCACHE_LINESIZE
● $ cat /proc/cpuinfo | grep "clflush size"
● $ cat
/sys/devices/system/cpu/cpu0/cache/index0/co
herency_line_size
● Or use sysconf(3).
● Or hardcode in you build process.

How do you pad a structure to have
the size rounded to cache line size?
● Do it at compile time (of course).
● Add a char[n] member to your class/struct/...
● Or use special compiler padding syntax.
● Sometimes, this is counter productive when two
of these structures could fit into a single cache
line for instance. In these cases laying out
memory of the entire data structure is a better
strategy.
● Dont do this without thinking about it!

How pipeline organization in a core
effects programming?
● Example: A CPU has two pipelines: one for
integer computation and one for floating point.
● Then it is better to have integer instructions
interleaved with floating point ones (they will be
executed concurrently).
● The problem is that this type of advice is very
hard to follow in practice.
● And the compiler is supposed to be able to do
that for you anyway...

How pipeline organization in a core
effects programming?
● The CPU rearranges the order in which it does
loads and stores as a side effect of pipeline
● The semantics of your code remain the same, the
order does not matter.
● Usually you don't care about this and should leave
this alone.
● Except in cases where you are writing a
synchronization mechanism or talking to hardware.
● In that case you would use a machine barrier to
prevent this reordering (will be explained later).

The problems of memory
● Why do CPUs need a memory cache at all? The idea is
that not all memory acts in the same speed
● There are, and always will be, better technologies for
faster memory, and old technologies for slower memory
● There is a huge monetary difference between these
types of memory.
● Why didn't they make the entire Airplane out the stuff
that makes up the black box? ...:)
● So can we use the fact that we have fast and
expensive memory as opposed to slower and less
expensive memory to speed up the computer? Yes.

Symmetric vs non symmetric
memory
● Access to main memory is not uniform for all CPUs or
cores in the system
● When the access is the same (there are various
techniques to do that) then the system is called UMA
– Uniform Memory Access.
● When the system is not symmetric (e.g. main
memory is connected to CPU0 and CPU1 has to ask
CPU0 for access) then the system is called NUMA –
Non Uniform Memory Access.
● There are other types of NUMA systems as well
(more on that later)

The CPU cache
●
A relatively small area of memory within the CPU that is made out
of the fast expensive memory and which stores a copy of some
addresses accessed by the CPU.
●
The algorithm that chooses which addresses to store in the cache
is varied across different CPUs.
●
some store the last addresses accessed, some store the most
heavily used ones.
● The idea is to reduce the AVERAGE time it take to access a given
memory address.
● Rests upon the assumption that software is not chaotic and tends
to reuse the same addresses most of the time.
●
If you access memory randomly you will break this assumption
and will thrash the cash (discussed later).

Types of caches
● In reality CPUs have more than one type of
cache:
● DCACHE: data cache used for data access
(read and write)
● ICACHE: instruction cache (used to cache
execute only memory which is your code)
● TLB: Translation Lookaside Buffer. Speeds up
virtual to physical address translation.

Number of caches
● In reality CPUs usually have more than one
cache.
● As you get away from the core caches become
bigger and cache lines become bigger.
● But at the same time access becomes slower.
● Until we reach main memory which is biggest
and slowest.
● These are called L1, L2, L3, ...

Cache thrashing...
●
thrashing: using many resources to do little work. bad.
●
A piece of software allocates huge amount of ram and then accesses
address by random accessing each address one time.
● A piece of software at any given point in time with so much ram that the
same address is never accessed twice.
●
Two threads or processes that share a cache. Each one would run without
cache misses but both produce lots of cache misses.
●
In all of these cases the cache is used (it is in hardware after all) but it is not
useful.
● The cache is only useful if you have low cache miss rates.

Coherent vs non coherent platforms
● Coherency is a very important subject.
● A perfectly coherent platform is one where all
components of the system (IO devices, all cores and
CPUS) see memory in exactly the same way.
● Coherency is a hardware feature which can be
controlled by software (special machine instructions).
● Most real systems are incoherent in some ways but not
in others.
● Some systems could be used in both coherent and
incoherent ways.

Coherency - Example
● Intel is said to be a cache coherent platform.
●
This means that two processors see the same memory.
●
If one processor puts data in cache then the other processor will
respect that and will fetch the data from that cache.
●
This slows down the both CPUs a little.
● But it makes programming a whole lot easier.
● But the same platform (Intel) - is not coherent between CPU and
peripherals.
●
A special instruction needs to be used to make sure that CPU sees
data put into main memory by peripheral and the reverse

Coherency and memory reordering
● Coherency has nothing to do with memory
reordering.
● A platform can be coherent but still have
memory reordering (Intel is an example). A
platform can also be non-coherent and have no
memory reordering (older RISC systems).
● In theory different instructions should deal with
memory reordering and coherency but in
practice one instruction usually does both.

Optimizing CPU cache utilization
● First find the CPU cache line size
● Then define the most common structures to fit
inside a cache line.
● Another idea is to split commonly used data from
uncommonly used data. This will make sure that
the commonly used data will fit into one cache line.
● You can pad your structures to a cache line size
using compiler hints.
● See example....

CPU counters
● modern CPUs hold counters for all of the above
● TLB misses, data cache misses, instruction cache
misses, branch prediction misses, instructions
executed and much much more
● some of the counters are only accessible from
kernel mode and most from any mode.
● Writing to them is usually restricted to kernel
mode.
● in order to access them you need to write
assembly code which is non portable.

CPU counters - problems
● the problem is counting them right since these
counters are not per thread or process because
the CPU doesn't know about threads or
processes.
● If you are the only process running then no
problem (although you are counting the OS as
well).
● if not then better solutions are required

CPU counters – best practices
● Writing your own code in assembly is hard.
Most people don't do that.
● on Linux use perf
● use a library like PAPI to give you the CPU
counters involved.
● see demo and example...

What is NUMA?
● On very large system using a single main memory
bank becomes the main bottle neck.
● On such systems the memory is divided into banks
where each bank is attached a set of CPUs.
● Example: 4 memory banks of 4GB each where
each bank is attached a Dual Core processor.
● The banks are connected via a bus to allow each
processor to access any address.
● Access to the local bank is, of course, much faster
than to access a remote bank.

NUMA on Linux
● The Linux kernel knows about NUMA and reads
the NUMA topology at boot time.
● It schedules processes on processors according
to memory affinity.
● It tries to allocate memory for a process on the
memory bank closest to that process.
● Once a thread is running Linux will try very hard
not to migrate it to a core that is on a different
bank since that will cause large memory latency.

NUMA syscalls on Linux
● There are various system calls on Linux that allow
a programmer to control it's properties in relation to
NUMA.
● mbind allows a programmer to bind memory to a
NUMA nodes.
● migrate_pages and move_pages move pages from
bank to bank.
● set_mempolicy and get_mempolicy allow control of
the kernel policy regarding NUMA.
● /proc/[pid]/numa_maps shows the NUMA mapping
for a certain process.

Memory policies in programming
● When there is a chance of two threads on two cores
sharing the same cache line reduce it by moving
memory apart or by padding structures.
● When the reverse is true and a flat area of memory is
accessed only by one thread on one core try to make
memory as small as possible.
● When you have frequently used memory next to non
frequency used memory move them apart (redesign of
your data structures may be required)
● Measure, measure, measure
● See example....

Concepts and definitions
● In order to deal with programming for multi-core
or even time sharing machines we have to use
various tools and concepts.
● We will now go over these concepts.
● Some of these are low level mechanisms
supplied either by the OS or libraries.
● Some are high level concepts which set a
framework for the programmer to think in.

Atomic operations
● A set of CPU instructions intended to aid low
level programmers in writing synchronization
primitives.
● These instructions can be used from user
space so you can use them too.
● Either to gain performance benefits in various
conditions.
● Or in order to write synchronization primitives of
your own.

Atomic operations (cont)
● Usually the compiler vendor wraps these primitives
for you.
● You can still write assembly code if you want.
● In GCC and Intel compiler see documentation for
__sync_*
● New C++ 11 APIs also exist in some compilers
(__atomic_*).
● Intel threading building blocks has a nice atomic<T>
template.
● These operations are fast (nanosecond time).

Atomic ops : what do they do?
● Guarantee that other CPUs and HW do not see wrong
intermediate “junk” values.
● Guarantee that the CPU caches for this variable will be
synchronized (only for platforms that do not synchronize
the caches).
● Guarantee that the HW will see the change.
● Prevent reordering of instructions around the atomic
operation.
● Prevent race of the same op by two or more cores.
● See example...

Mutex
● definition: an OS/library based construct
enabling two pieces of code to avoid entering a
predefined piece(s) of code at the same time.
● How are mutexes implemented?
● The simplest way is a bit inside the OS kernel
backed up by a test-and-set CPU machine
instruction.
● This means that in order to lock the mutex you
have to go to the OS (microsecond range).

The problems of mutexes...
● A good design is one in which mutexes are
unlocked most of the time
● Otherwise these mutexes serialize all cores.
● This means that we waste the trip to the OS
kernel because we don't need the OS to put us
to sleep or to wake us up.
● Can't we just lock the mutex from user space?

Enter futex / critical section
● A futex is just like a mutex with the exception
that the state of the lock is shared by all
concurrent processes in user space.
● The test-and-set is executed in user space and
only if it fails (the lock is locked) then we go to
the OS to put the current thread to sleep.
● The reverse applies when leaving the critical
section: if no one is waiting the trip to the OS
kernel is avoided.
● The problem with this idea is robustness.

Futex robustness
● What happens if a process dies holding a
futex?
● In a multi-threaded program the whole program
dies (a thread cannot die) so we don't care.
● In a multi-process program we must release all
locks held at time of death.
● This is referred to as “futex robustness”.
● In Linux all futexes are robust.

Semaphores
● Semaphore is a very overloaded concept.
● Sometimes it is used in place of mutex.
● Sometimes it is used with a meaning of
“number of permits”
● Sometimes it is used as mutex which has a
name and therefor could be used with multiple
processes as opposed to multi-threading.
● Linux has two kinds of these semaphores: sysv
and POSIX.

Readers/Writer lock
● A resource protected by a lock is usually
protected because a write and read at the same
time would create problems.
● Multiple reads are not a problem and do not
need a lock at all.
● If we have many readers compared to writes
we could gain performance by allowing many
readers into the critical section and not allowing
writers.
● A writer would enter the critical section alone

Readers/Writer lock
● If we only have writers then we serialize all of
them and therefore gain no benefit from this
type of lock.
● How is a readers/writer lock implemented?
Using a regular type of lock.
● In Linux it can even be shared between
processes.
● See: pthread_rwlock_*

Spin lock
● We are talking about user space spin locks here (do not
confuse with kernel space spin locks).
● Is a lock that spins – busy waits – until a test-and-set
operation succeeds.
● This is efficient when the operation waited upon is very short
compared to the trip to the OS to lock the structure.
● On the other hand if the lock is not contended then it has the
same speed as a futex.
● In the contended case it may be problematic if the two
contending thread are on the same core. That is why they are
usually used for threads running on different cores.
● See pthread_spin_* on Linux...

Lock free algorithms
● Are a class of algorithms that require no locks
during their operation.
● They usually require atomic operations instead.
● Examples are a lock free circular buffer, lock
free hash table and more.
● Sometimes these algorithms rely on spinning for
a short while.
● Usually quite a pain to implement correctly.
● But you can use a library instead...

Dead lock
● A situation in which no work is done because everyone
is waiting on a lock and so no one is releasing any
lock...
● Example: A takes “a” lock, B takes “b” lock, A waits for
“b” lock, B waits for “a” lock.
● Solutions: code carefully or always acquire locks in the
same order.
● There are all kinds of dead lock detectors available
that have overhead but can be used at development
time and removed in production.
● Dead lock is called dead because all threads are
sleeping when it occurs.

Live lock
● A situation where no work is getting done
although all threads are live.
● This happens when all threads yield on to the
other instead of acquiring the lock.
● … and no thread actually acquires the lock
● Since most code doesn't yield this almost never
happens in practice.
● See sched_yield(2) in Linux...

Memory sharing
● Having one piece of memory accessible by
multiple running entities (threads or processes).
● Multiple threads share all memory by default.
● Multiple processes share no memory by
default.
● In Linux there are two types of shared memory:
SYSV IPC and POSIX.
● And libraries like ACE and BOOST to wrap
them...

Memory sharing (cont)
● If you share data in the shared memory then
you have to either:
● Have a piece of data per process and then you
don't need to protect it.
● Have multiple processes access the same data and
then protect it via all the mechanisms describe
before
● Use a specialized lock like readers/writer lock.
● Use a specialized framework like RCU.
● Use a lock free algorithm.

Memory sharing (cont)
● In Linux most synchronization primitives can be
placed into shared memory.
● This means you can use a futex, spinlock,
reader/writers lock or an atomic variable to
synchronize processes.
● That is the reason that futexes in Linux were
implemented as robust futexes.

Message queues
● A mechanism by which two processes or
threads could pass messages one to the other.
● If used for multi-processing then handled by the
OS.
● If used between processes it is an OS
managed resource, is limited and has efficiency
issues since messages are copied from source
process to OS and then back from OS to target
process.

Message queues (cont...)
● In a multi-threaded environment message
queues are used to pass pointers to messages
around and more as a synchronization
mechanism.
● This is because all threads share memory.
● ACE and BOOST both implement queues.

Barriers
● A synchronization primitive that has a number
associated with it.
● When a thread arrives at a barrier it goes to sleep.
● When enough threads arrive at the barrier (the number
specified at the barrier creation) the barrier is “broken”
and all the threads continue to run.
● Typical use: real time application time slice work.
● Linux: see pthread_barrier_*

Condition
● A synchronization primitive enabling one thread to sleep
waiting for a software defined condition to occur.
● Another thread may change the software state and signal
sleeping threads to check the condition.
● A condition has a mutex associated with it that must be
grabbed before using it.
● In Linux: see pthread_cond_*
● There are ACE and BOOST wrappers.
● See example...

RCU
● RCU: Read/Copy/Update
● RCU is framework for thinking about accessing
a shared resource.
● The idea is not to block either readers or writers
in accessing the resource.
● Instead, writers get their own copy.
● The new copy gets substituted atomically.
● Readers who started reading the old copy
continue to read the old copy.

RCU (continued)
● New readers will read the new copy.
● The problem with RCU is the costly copy
operation for writers.
● Usually implemented in heavy systems where
the cost of the read or write operation is heavy
and outweighs the cost of the copy.
● A good example is a database system.

Machine memory barrier
● Is a specialized machine instruction that tells the CPU you are running
on not to reorder reads or writes around it.
● In theory there are all kinds of barriers: read, write, read and write, io
and CPU, only CPU, etc.
● In practice every hardware platform implements only a subset of those.
● In user space it is rarely used because authors of synchronization
mechanisms use them and the end programmer just uses the
synchronization mechanisms.
● In user space Linux: __sync_synchronize() is a full machine barrier
along with all other atomic operations.
● In kernel (or in writing device drivers in user space) it is used heavily to
communicate correctly with hardware (mb(),rmb(),wmb())

Compiler barrier
● Is a compile time instruction to the compiler
(much like #pragma) that tells the compiler not
to assume values of memory around the
instruction.
● See example...
● It is better to use a compiler barrier than the
volatile keyword for writing synchronization
primitives.

Volatile considered harmful
● Because it doesn't allow caching, even within a
critical section.
● Because it disables use of CPU caches on
various platforms (not on Intel).
● Basically because it is an overkill.
● Better use the compiler barrier mentioned
before...

References
● Multicore Application Programming, For
Windows, Linux, and Oracle Solaris. Darryl
Gove.
● The Art of Multiprocessor Programming,
Maurice Herlihy and Nir Shavit.
● What Every Programmer Should Know About
Memory, Ulrich Drepper,
http://www.akkadia.org/drepper/cpumemory.pdf
● The Paul E. McKenney book on RCU.

Synchronization mechanisms
● Nothing
● Atomics
● Lockless Algos/DS
● RCU/COW
● Readers/Writer lock
● Spinlock
● Futex
● Mutex
● Semaphore

Multicore

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Multicore

Similar to Multicore (20)

More from Mark Veltzer

More from Mark Veltzer (11)

Recently uploaded

Recently uploaded (20)

Multicore