3. Micro optimizations
● This presentation is about using the C++ language
effectively.
● This means that we will not be talking about how long
other parts, which are not your code, are taking.
● Always remember that for a lot of systems the real bottle
neck is not within your code but in some other place (too
much IO, heavily loaded database, network,...).
● That is why the types of advice in this presentation are
sometimes called micro optimizations
4. What we will not be discussing
● There are two very important optimizations that could
be considered more important than micro
optimizations that we will not be discussing:
● Algorithmic optimizations.
● Architectural optimizations.
● These could bring tremendous improvements to your
code but have less to do with the technicalities of the
C++ language.
● In reality implementing them does require knowledge
of the C++ language but not at the micro optimization
level.
5. Reasons against micro
optimizations
● Some people consider micro optimizations an evil.
● They claim that since bottle necks are not in your code then you
shouldn't pay attention to these types of optimizations since:
● The ease of implementing micro optimizations gives you the feeling
that you improved the software even though the bottle neck is really
somewhere else and in-fact you just complicated your code with no
benefit which means your situation actually got worse.
● They make your code harder to maintain and read.
● They require you to learn something which is not important.
6. Reasons against micro
optimizations (cont)
● Even if your bottle neck is within your code you are better
off putting effort into either algorithmic or architectural
(including multi-core) changes.
● Although these are harder to implement the claim is that
per time spent they give better returns.
● Micro optimizations distract you from doing architectural
and algorithmic changes which are more important.
● Micro optimizations hide the algorithm or architecture in a
sea of details which makes understanding and modifying
the software difficult.
7. Reasons against micro
optimizations – less agility
● Some claim that micro optimizations make your
software less agile.
● Because they make each piece of code more
difficult they make the entire software harder to
change at the architectural level.
● This means that the software making process
will be slower to respond to new requirements.
● This could have business implications as well
as technical ones.
8. Reasons against micro
optimizations - micro optimizations
as premature optimization
● “We should forget about small efficiencies, say
about 97% of the time: premature optimization
is the root of all evil" – Donald Knuth
● There is a connection between micro
optimizations and premature optimization
because in a lot of cases micro optimizations
are done prematurely.
● So when first coding your system – Keep It
Simple Stupid – which means avoiding
unnecessary micro optimizations.
9. Reasons for micro optimization
● The main point: You need to write your code anyway
– so you may as well use the right features and get
better performance while you are at it.
● If the language has two features and you use the
one which is giving you better optimization then you
gain more efficiency without complicating your
code.
● Sometimes you are CPU bound and micro
optimizations could squeeze more performance out
of an existing code base.
10. Reasons for micro optimization
● Sometimes you are willing to live with the downsides
of micro optimization (both technical and business).
● Sometimes, just sometimes, your design is good
and you think that it may stick around for a while and
micro optimizations just make it better.
● In these cases the loss of agility is not a big problem
as it seems.
● This happens in mature products more than in young
ones.
11. Reasons for micro optimization
(cont)
● Algorithmic and Architectural changes are just
too difficult.
● Architectural changes sometimes means huge
rewrites which companies rarely can handle
because of organizational issues.
● Algorithmic changes require lots of theoretical
knowledge and sometimes low level details
(turning a lock using algorithm into lock free
requires understanding memory and machine
barriers...)
12. Micro optimizations
● If you are coding in C++ then you, albeit maybe
unconsciously, agree that too much low level
optimization is evil.
● Otherwise you would be coding in C or
assembly. Or maybe designing a dedicate
piece of hardware?
● Some people think so highly against micro
optimizations that they go for higher level
languages like Java/C#/Python/Ruby which
prevent these types of optimizations by design.
13. Micro optimizations and the
compiler
● The compiler does lots of optimizations which
could be classified as micro optimizations:
● In-lining.
● Loop unrolling.
● Constant folding.
● Dead code elimination.
● Common subexpression elimination.
● Loop invariants.
● And many more...
14. Micro optimizations and the
compiler
● This means that the micro optimizations we are
talking about here are optimizations that the
compiler cannot perform.
● In practice, however, it is not always simple to
know which optimizations are performed by the
compiler.
● Plus compilers tend to become better with time.
● And you may need to support several compilers
that have a different feature set.
15. Micro optimizations - conclusions
● Should you use micro optimizations? Yes.
● How many of them? Depends on your product (is
it mature, is it's design going to change etc).
● What types of optimizations should you prefer?
Ones that don't make your code harder to read
and understand and ones that are not already
performed by the compiler.
● When to avoid them at all? When your product is
young and you are not sure about the design at
all. Then they are just a waste of your time.
17. Measuring efficiency
● There are many ways to measure efficiency.
● The various methods may give somewhat
different results.
● They are also not equally easy to use.
● They sometimes are in conflict one with the
other.
● For instance: you may increase latency and
decrease average response time.
● And all measures depend on the dataset used.
18. Measuring efficiency
● Measurement methods include:
● Measuring number of instructions in assembly
code.
● Measuring number of instructions issued at run time
using CPU counters.
● Measuring time elapsed using CPU counters.
● Measuring time elapsed using hardware clock.
● Measuring time elapsed using OS clock.
● Full profiling in software or hardware.
19. Measuring CPU cycles
● Is one way to measure efficiency.
● In RISC chips this is more accurate since every
instruction is executed in just one cycle.
● In CISC chips things are more complicated.
One instruction may be over in one clock while
others may last many clocks.
● And instructions are sometimes halted to wait
for external factors: main memory, other CPU
atomic operation, etc...
20. What does measuring CPU cycles
mean – assembly programmers?
● That even an assembly programmer writing
down instruction explicitly cannot know how
long his/her code will run.
● This is contrary to common opinion that
“assembly programmers know exactly how
long their code runs”.
● It is true that assembly programmers have a
firmer grasp on how long their code runs.
21. C complications – function call
● A function call in C may involve many cycles as
it needs to push the arguments, then jump to
the address and then the inner function needs
to get the arguments from the stack, do the
calculation, put the result onto the stack and
then the outer function will get the result from
the stack.
● There are quite a few machine instructions
involved in all of these.
22. C complications – function call
(cont)
● On the other hand if the inner function is inside
a header file and the compiler is an optimizing
one it could eliminate the function call
altogether.
● And if the arguments are constant the compiler
may eliminate the calculation too and just hard
code the result.
● So, a function call may range anywhere
between one instruction and hundreds or more
even for a simple function.
23. C complications – switch statement
● A single switch statement in C can be realized
by a compiler in several ways:
● A series of compare-and-branch statements.
● An array of jump locations.
● A perfect hash of jump locations.
● Actually good compilers use all of these
combined when optimizing.
● So how long does a “switch” statement take? I
donno!
24. C complications - registers
● In C the programmer may advise the compiler about
using a register for a variable using the register
keyword.
● The programmer, however, does not know if the
compiler actually manages to store the variable in a
register and does not get feedback about it (aside
from disassembly, of course).
● The difference in performance is great as access to
register and to memory (even cache memory) is
about order of magnitude in difference.
25. C complications – cache lines
● When programmers start programming in C
they are told nothing about cache lines, cache
misses etc.
● After a while they are told that structures
accessed heavily should be small as to fit in a
cache line so as to hit the CPU cache and not
main memory.
● And after a while more they are told about how
synchronization between CPUs works.
26. What does all of this mean? - c
programmers
● That even when programming in C the
programmer only has vague idea of how his/her
program is translated into machine code.
● The programmer has to try and wrap his/her
head around various low level concepts
(registers, switches, caches) in order to
understand the speed of his/her program.
● So analyzing speed is not simple, even in C.
27. What about C++, and higher level?
● In C++ things get even nastier.
● Various basic building blocks like constructor,
destructor, virtual method call, exception, copy
constructor are even harder to estimate.
● So the feel of the programmer for the runtime
gets more vague.
● And in Java the programmer doesn't even know
about machine instructions...:)
28. Criticisms of C++ performance (low
level)
● Low level people criticize C++ for being too
slow.
● As a result they say they'd use assembly or C
over C++.
● The claim is that C++ does too much behind
the programmers back
● This prevents the programmer from getting a
good grasp of what is happening performance
wise.
29. Criticisms of C++ performance (high
level)
● High level people criticize C++ for being too
slow.
● They claim that C++ forces the programmer to
focus too much on low level details.
● This make the design suffer and be suboptimal.
● Which in turn leads to bad performance.
30. Is C++ performance bad?
● No.
● It really depends on the programmer.
● And on the programmers knowledge of what
C++ does behind the scenes.
● If the programmer knows when and what C++
does behind his/her back then he/she knows
how to handle the situation.
● This is exactly what this presentation is for.
31. Is C++ performance bad? (cont)
● Since C++ is a more complicated language it can
be abused by programmers who do not know it's
features in a more creative manner and create
systems that perform more poorly than systems
implemented in lower level languages (C,
assembly).
● But C++ is usually thrown at big problems where
development in C is more difficult and requires
object oriented approaches. In these cases C++
abuse is less of an issue since C abuse in 5large
systems is an issue also.
32. Is C++ performance bad? (cont)
● So C++ sometimes performs slower and sometimes
faster than equivalent systems implemented in lower
level languages.
● If the system you are implementing is a large one
then using object oriented and C++ has an
advantage over C since it is more agile and will
allow you to reach a good design with less rewrites.
● Since rewriting is problematic in a corporate
environment C++ could actually outperform C in
such environments.
33. So what should we do?
● Remember the example of the C function call
and the C switch statement?
● C++ has many more of these (virtual functions,
exceptions, constructors, destructions,
templates and more).
● We should study the language to understand
exactly when and what the language does.
● And understand how to use the features
effectively.
34. Comparison of effectiveness issues
in C++ vs other languages
● Other languages, like Java, also have effectiveness issues.
● The difference is at the level at which the programmer needs
to peek in order to understand the issues involved.
● In Java or other higher level languages effectiveness books,
courses and presentations deal with higher level
considerations than in C++.
● Issue such as CPU cache lines never appear...:)
● Some see this as a weakness of C++.
35. C++ and over design
● Over design is a situation where a programmer
creates too much infrastructure in a piece of
software, much of which is never used.
● Over design is problematic because it makes the
code larger and more complicated and thus making it
harder to understand, maintain and develop further.
● Over design also makes your software run slower for
no good reason.
● Over design is a notorious problem in object oriented
languages and is usually inflicted by beginner
programmers.
36. C++ and over design (cont)
● Over design rests on the assumption that you
must make your software flexible for future
requirements now, regardless if that flexibility
will be used in the future.
● Experience shows that programmers make
their program flexible where it is convenient for
them to do so and not where future
requirements will demand that flexibility.
● This means that over design is rarely a good
thing.
37. C++ and over design (cont)
● Mature and young products are different in
respect to over design and young products tend
to exhibit this illness more.
● C++ has potential for over design.
● Since over design means lower performance
then C++ has potential for lower performance in
that respect.
● C has less over design issues and assembly
even less because they are harder to code.
38. C++ and over design (cont)
● All you need to do to solve the problem is
simply avoid over design.
● This is very close to the concept of agile
development.
● Agile development tries to reconcile the two
following statements:
● You don't know the requirements of a piece of
software in advance.
● You want to avoid over design.
39. C++ and over design (cont)
● Instead agile development puts an emphasis on
the development process:
● Keep code simple.
● Keep the programming environment simple.
● Make it as easy as possible to change and test the
code.
● This makes the adjustment to new features easy
without over design.
● So one solution to the problem of C++ and over
design is agile development.
41. Reasons for efficiency
● Even if hardware advances competition
between software vendors remain. Since they
use the same hardware the winner will be the
one with more software efficiency.
● Communication speed has exceeded CPU
speed and so the software is the new
bottleneck. Due to faster optical and other
communication solutions it will stay that way.
42. Reasons for efficiency (cont)
● CPU speed improvement has slowed down and
given way to multi-core/multi-CPU designs.
● This is due to various low level physical
problems which are not likely to be lifted in the
near future.
● But multi-core/multi-CPU solutions are based
on software that is specifically written in order
to utilize such systems and so software
efficiency is once again the bottle neck.
43. Reasons for efficiency (cont)
● Last but not least is the act of writing software.
● A simple error in programming could cause a
piece of software to run 100 times slower than it
could.
● Example: NTPL in Linux made thread creation
a 1000 times faster.
● Software is still key to efficiency.
45. What type of efficiency do we
mean?
● Binary size: we don't care about that much
these days (data much larger than code).
● Compile time: Most people don't care about this
but maybe you should?
● Data size: yes, but C++ data structures tend to
be quite efficient and RAM is cheap these days.
● Code efficiency: This is the one which most
issues relate to.
46. What type of efficiency do we
mean?
● Code efficiency is usually split into:
● Latency/response time/responsiveness
● Throughput.
● These goals are a trade off: increasing one will
tend to decrease the other.
● You have to decide where you are in balance
between these two opposing objectives.
● Being somewhere in the middle is fine as long
as you define clearly exactly where you are.
48. Fast path: definition
● Most software has a lot of code which is only used, if ever,
one time or rarely.
● This includes setup, tear down, error handing and more.
● More than 90% of code is like that (sometimes a lot more).
● This is the “boring” code or the “slow path”.
● Fast path is the “non boring” code that actually gets used
most of the time.
● One function can include both slow path and fast path code
(e.g. it gets used often but has lots of error handling code).
49. Fast path optimizations
● Micro optimizations, and any other
optimizations for that matter, only make sense if
you actually use them a lot
● This means that the fast path should receive
most of the love and care.
● Optimizing code outside of the fast path will not
provide better performance and will just
complicate your code making it harder to
read/understand/modify/adjust to new
requirements.
51. How is a class implemented?
● A class is just like a struct with compile time
checking of access (private, public, protected).
● It is packed like a struct and you can tell the
compiler the offsets of members (although you
rarely have to).
● The methods are not within the class. They are
nowhere to be found in the class. The compiler
arranges for the right object to be passed using
compile time type checking. Virtual methods
are a different matter.
52. How is a class implemented? (cont)
● The minimum size of a class is 1 byte.
● This is to make sure that memory really is
allocated.
● sizeof() cannot also return 0.
● Think of sizeof(array)/sizeof(instance) if
sizeof(instance)==0...
53. How is inheritance implemented?
● Inherited classes are just like structs layed out
one after the other.
● So if B is derived from A then sizeof
(B)=sizeof(A)+sizeof(fields of B).
● Different compiler vendors can pack A before B
or vice versa. This is an ABI issue.
● Different compiler vendors can decide on
different alignment between the A and the B
part of the object. This is an ABI issue.
54. Inheriting from empty classes
● An empty class takes 1 byte.
● But if you define a class derived from it and
adding an integer you will find that the size of
the derived class is 4 and not 5.
● This is because the compiler knows that the 1
byte in the empty class really doesn't refer to
data.
● So inheriting from many empty classes doesn't
make your class bigger.
55. How is multi-inheritance
implemented?
● If C is derived from A and B then C has it's own
fields, A's fields and B's fields (unless it's virtual
inheritance).
● The physical layout of these fields is again an
ABI issue that is implemented differently by
different compiler vendors.
56. Virtual methods
● If a class has at least one virtual method then
the size of the object grows by exactly one
pointer.
● That pointer points to a structure called a
vtable.
● There is exactly one of these in RAM for each
class.
● The table holds the method pointers for all
virtual methods in that class.
57. Virtual methods (cont)
● The compiler vendors could have held pointers
of more vtables inside a vtable but most of
them choose to create one flattened vtable per
class.
● The order of the functions within the table is
compiler dependent ABI issue. It could be order
of declaration, alphabetical order, whatever (as
long as it is consistent).
● Each call to a virtual method goes through this
table.
58. Virtual methods (cont)
● You can think of the vtable pointer as the type
of the instance the difference being that classes
without virtual methods don't have this “type”
information attached to them.
● This is not the same as RTTI which adds info
pointed to by the vtables.
59. Virtual methods (cont)
● This means that adding the first virtual method to a class
adds a pointer to the size of the instance.
● Adding a virtual method to a parent class will add that
pointer to all of it's decedents.
● Virtual methods are one dereference slower to call to.
● And they cannot be in lined unless the compiler can
determine the exact type which is rarely the case.
● Any constructor sets up the pointer to the vtable which
makes the constructor slower.
60. RTTI
● The ability of the system to report on the
dynamic type of an object and to provide
information about that type at runtime (as
opposed to at compile time)
● When utilized consistently can be a powerful
tool to ease the work of the programmer in
managing resources.
● Comes at the cost of a little performance.
● Includes typeid and dynamic_cast
61. RTTI - typeid
● typeid – A function that can be given a type and
returns std::typeinfo which describes the type.
● You can use this return value for the name of the
type (.name()) or to compare type information
from two instances.
● According to the manual the name of the
instance is implementation (ABI) specific.
● Furthermore, typeid is only guaranteed to work
on instances which belong to classes that have
at vtable (meaning at least one virtual method).
62. RTTI – typeid (cont)
● In order to work with typeid you must include
<typeinfo>.
● Typeid sometimes works at compile time and
sometimes leaves code that understands the
type at runtime.
● Most compilers allow you to work without typeid
and get a little bit more performance (in terms
of code and size of vtables).
● In gcc this is the -fno-rtti flag.
63. RTTI - dynamic_cast
● Allows you to try and cast an object to some
type at runtime.
● Because it is done at runtime it may fail at
runtime and your code should include a check
for this.
● Works by traversing the vtables of the objects
so it is slow and depends on the number of
classes in your hierarchy.
64. RTTI – should you use it?
● Good design usually means not needing to use RTTI
which is also good from a performance standpoint.
● typeid returns name() as a string and strings are slow
to compare so never use that to determine what to do
with an instance.
● dynamic_cast is slow (as explained) and thus should
not be used in the fast path.
● Inheritance with virtual methods is pretty fast and
should be used to distinguish between types instead.
65. Virtual destructor
● Is required if you:
● Want to destroy the class using delete.
● At the delete point the pointer may point to a derived
element.
● Different classes in your hierarchy would like to do different
things in the destructor.
● This is not the same as having virtual methods.
● You may have virtual methods and not need a virtual
destructor.
● You may have no virtual methods and need a virtual
destructor.
67. In-lining as C++ efficiency
● C++ efficiency is based on compile time optimization.
● But the compiler cannot optimize what it cannot see.
● So a lot of patterns in C++ optimization are based on letting the
compiler see more of your code.
● This makes the compiler see through a lot of the abstractions that
we need as developers and still do only the things that need to get
done.
● We maintain a layered software which is good for maintenance while
the compiler does the right thing. Good deal.
68. Problems of in-lining
● The problem of in-lining is that you need to let the
compiler see more of your code.
● This means you have to write more in h files.
● Which makes the compilation longer.
● And modifications to implementation now causes
long project recompiles.
● This is similar to the problems of templates in C++.
● Only use when you really need the performance
boost.
69. Advantages of in-lining
● Avoiding yet another function call. Function call is not
cheap since it uses the stack RAM.
● Avoiding storing registers into ram before the
functions call and collecting them after the call. Since
RAM access is long compared to register access
then this is a very good speedup.
● Better allocation of registers since the allocation is
combined between the inner and outer functions.
● Allow the compiler to do dead code elimination based
on constants passed to the inner function.
70. Advantages of in-lining
● Allows the compiler to reorder instructions
between the outer and inner functions and so
utilize the hardware better.
● Less code executed and so less instruction
cache misses.
● Allows the compiler to do loop invariant
optimization if the inner function is called from a
loop in the outer.
71. Advantages of in-lining
● Allows the compiler to know whether pointers
passed to the inner function are really changed
by it. If they are not (as in many cases) the
compiler can avoid load and stores to that
address. This is even in the absence of correct
const statements.
● A function call is an optimization barrier in the
outer function and it is now removed.
72. In-lining - conclusions
● In-lining provides many, many advantages to C++
programmers.
● Obviously you have to compile with optimization to
get those advantages.
● And write code in header files.
● When you do optimize in-lining occurs even in non
header files.
● A barrier to in-lining is calling an unknown function
which usually happens when calling a virtual
function. So try to avoid those in the fast path.
74. Fast construction and destruction
● Consider the need to hold a resource which
must be released (mutex, file,...)
● A common pattern is to put this resource into a
class that releases the resource upon
destruction.
● But if this pattern is used in the fast path then
we must make sure that constructors and
destructors will not hurt our performance.
75. Fast construction and destruction
● Do not allocate on the heap – do it on the stack.
● If you do use the stack then there is no “delete
with unknown type” and so less need for a
virtual destructor.
● Try to avoid the need for a virtual destructor in
order to avoid the need to go through the virtual
table and the function call.
● This does not mean you cannot use
inheritance.
76. Fast construction and destruction
● Avoid virtual functions in the object in order to
avoid the extra virtual pointer assignment in the
constructor.
● In line the class by putting it into a header file.
● This will make sure that the construction and
destruction will not trigger a function call which is
expensive.
● Try to avoid composition in this case as it may
create even more work which involves the heap.
77. Fast construction and destruction
● Remember that constructors construct all
members if not explicitly initialized using
member construction syntax.
● And their parents...
● And so do destructors.
● Try to keep fast classes as simple as possible.
78. Lazy construction
● If some fast paths do not require everything that
you do inside the constructor then avoid calling
that work in the constructor.
● Example could be some inner member
composition initialization.
● Instead only construct them lazily when
needed.
● This requires careful coding that makes sure
that the member is not used unless initialized.
79. Redundant construction
● When ever an object is composed of other
objects and the constructor is triggered it
creates the other objects using their default
constructors.
● Only then is the constructor run.
● This means that if you initialize the member
only within your construction you are paying for
redundant construction.
● Instead initialize with Class::Class():attr(args)
80. Avoid creation of temporary
variables
● If you have an assignment of two types and an
exact assignment operator is missing then it
may be fulfilled by constructing a temporary out
of the right hand side and then calling the
assignment operator.
● This may be expensive.
● Try to make sure you have the right assignment
operators in place to avoid these costly
constructions of temporaries.
81. Virtual functions cost
● The assignment of the vptr.
● The extra dereference in method calls.
● The inability of the compiler to in-line the
function.
● The cost of the first two seems small. The cost
of the third is unknown and may be large
depending on the circumstances.
82. Virtual functions cost – when should
you care?
● The inability to in-line is indeed problematic but
remember that the function you did not inline
could be in-lined internally.
● This means that what you really care about is
the call/return cost+more in-lining that is
possible out of seeing the two pieces of code
together.
● If your function is long then you may not care
about this at all – so you should care only if
your function should be very fast.
83. Virtual functions cost – solution 1
● Avoid the virtual function at all.
● You may not really need virtual on it (are you
over designing?)
● Write two classes instead of one (do you really
want to mix them or are you just using one in
any particular run?)
● Use a macro to decide which to use in each
run?!?
84. Virtual functions cost – solution 2
● Pass your class to the using class as a
template parameter.
● Templates are more performance friendly than
inheritance hierarchies.
● They push type resolution from runtime to
compile time (which we consider to be free).
● If the template class is in-line ready then it will
be in-lined in the using class.
86. References
● Efficient C++ Performance Programming
Techniques – Dov Bulka, David Mayhew.
● Effective C++ - Third Edition – 55 Specific Ways
to Improve Your Programs and Designs – Scott
Meyers.
● More Effective C++ - 35 New Ways to Improve
Your Programs and Designs – Scott Meyers.
● The C++ FAQ: http://www.parashift.com/c++-
faq-lite/